Introduction¶

Machine learning (ML), data science (DS) and artificial intelligence (AI) are closely related disciplines.

  • Machine learning focuses on creating algorithms that allow computers to learn.
  • Artificial intelligence, encompasses ML and other technologies (e.g. heuristic search algorithms, robotics) aimed at creating intelligent machines.
  • Data science is a broader field that uses ML and other techniques (e.g. data cleaning, data management) to analyze data and make informed decisions.

A down-to-earth definition of machine learning

  • Imagine a table filled with numbers!
  • Machine learning is discovering the relationship between columns of the table.
    • In supervised learning, there is a target column that we aim to estimate based on the other columns.
    • In unsupervised learning, there is no designated target column.

A taxonomy of machine learning problems

  • Some examples of binary classification problems:
    • Predicting whether an email is spam or not, based on its content, sender, subject, etc.
    • Predicting whether a patient has a certain disease based on their symptoms, test results, medical history, etc.

Traditional machine learning vs. deep neural networks

Machine learning methods can be broadly categorized into two groups: traditional machine learning and deep neural networks. Traditional machine learning includes linear regression, logistic regression, support vector machines, decision trees, random forests, and gradient boosting. These methods are known for their simplicity, interpretability, and computational efficiency.

On the other hand, deep neural networks are composed of multiple layers of artificial neurons that are trained to learn complex representations of the data. They include convolutional neural networks, recurrent neural networks, generative adversarial networks, and transformers). These models have the ability to capture complex patterns in the data. However, deep neural networks are computationally more intensive to train and less interpretable compared to traditional machine learning methods. Additionally, they require larger amounts of training data to achieve optimal performance.

In this semester's Machine Learning course, we will concentrate exclusively on traditional machine learning techniques. For those seeking further knowledge on deep neural networks, the Neural Networks course offered in the following semester may be of interest.

Some machine learning terms

  • feature: one column of the data table,
  • example: one row of the data table,
  • training: finding the model parameters that fit well to the data,
  • prediction: estimating the target value of an input vector, using a trained model,
  • training set: a subset of examples that is used for training,
  • test set: another, distinct set of examples that is used for evaluating the trained model.

Some Moments from the History of Machine Learning¶


The Perceptron Algorithm: In the 1950s, a psychologist named Frank Rosenblatt proposed the Perceptron, a simple artificial neuron model that could solve linear classification problems. Although it was limited in its ability to solve more complex problems, the Perceptron was the first step towards creating more sophisticated machine learning algorithms.
The Dartmouth Workshop: In 1956, a group of researchers including John McCarthy, Marvin Minsky, Nathaniel Rochester and Claude Shannon gathered at Dartmouth College to discuss the possibilities of thinking machines. This conference is now considered to be the birthplace of the field of artificial intelligence.
The Backpropagation Algorithm: In the 1980s, Geoffrey Hinton and his students developed the backpropagation algorithm, which allowed neural networks to learn from their mistakes and improve over time. This breakthrough revolutionized the field of machine learning and paved the way for deep learning, a type of neural network with many layers.
The MNIST Handwritten Digits Dataset: In the late 1990s, researchers at the National Institute of Standards and Technology created a dataset of handwritten digits to use as a benchmark for evaluating machine learning algorithms. This dataset, called MNIST, is still widely used today and has played a major role in advancing the field of computer vision.
The Netflix Prize: In 2006, Netflix launched a competition to improve its movie recommendation system. The competition was called the Netflix Prize and offered a million-dollar prize to the team that could produce the best algorithm for predicting which movies a user would like. The winning team, BellKor's Pragmatic Chaos, used a combination of matrix factorization and collaborative filtering to improve the accuracy of Netflix's recommendations.
The Deep Learning Revolution: In 2012, Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton won the ImageNet image classification challenge by a large margin, using the deep convolutional neural network. In the following years, researchers outperformed state-of-the-art results on several benchmark datasets using deep learning. Deep learning has since become one of the most active areas of research in the field of machine learning.
AlphaGo: In 2016, a machine learning algorithm developed by Google DeepMind, called AlphaGo, defeated Lee Sedol, a 9-dan professional of the game Go, in a best-of-five series. This marked a significant milestone in the development of artificial intelligence and showed that machines could outperform humans in complex, strategic games.

The OpenAI GPT Models: In 2018, OpenAI introduced the GPT (Generative Pretrained Transformer) language models, which were trained on a massive amount of text data and could generate human-like text. The GPT models have been widely adopted and have led to advances in areas such as natural language processing and conversational AI.
</small>

Univariate Linear Regression¶

Linear regression is a simple but still useful learning algorithm that dates back to the 19th century. It was first described by Galton in the 1880s and was later formalized by Fisher in the early 20th century. Despite its age, linear regression remains popular in numerous fields with its implementation constantly evolving in response to technological advancements.

Exercise 1: Perform univariate linear regression on the MLB baseball player data in the baseball.txt file, using height as the input and weight as the target!

InĀ [1]:
# Load data.
import numpy as np
data = np.loadtxt('../_data/baseball.txt', delimiter=',')
data
Out[1]:
array([[188.,  82.],
       [188.,  98.],
       [183.,  95.],
       ...,
       [190.,  93.],
       [190.,  86.],
       [185.,  88.]])
InĀ [2]:
data.shape
Out[2]:
(1033, 2)
InĀ [3]:
# Extract input vector (height) and target vector (weight).
x = data[:, 0]
y = data[:, 1]
InĀ [4]:
# Subtract mean.
xm = x.mean()
ym = y.mean()
x -= xm
y -= ym
InĀ [5]:
# Prepare a scatter plot of the data.
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 8))
plt.scatter(x, y, alpha=0.7)
Out[5]:
<matplotlib.collections.PathCollection at 0x7f6508d96280>
InĀ [6]:
# Train model (compute the optimal w value).
w = (x @ y) / (x @ x)
w
Out[6]:
0.8561919786085516
InĀ [7]:
# Predict the weight of a player who is 190 cm tall.
(190 - xm) * w + ym
Out[7]:
94.01062994703861
InĀ [8]:
# Draw regression line into the scatter plot.
plt.figure(figsize=(8, 8))
plt.scatter(x, y, alpha=0.7)
x2 = np.array([x.min(), x.max()])
plt.plot(x2, x2 * w, color='red')
Out[8]:
[<matplotlib.lines.Line2D at 0x7f6508d5f280>]

Multivariate Linear Regression¶

Multivariate linear regression can be derived from univariate linear regression by adding additional input features to the model. The equation for multivariate linear regression is a extension of the equation for univariate linear regression, with multiple coefficients representing the strength and direction of the relationship between each input feature and the target feature.

InĀ [9]:
import numpy as np

def gen_data(n, d):
    rs = np.random.RandomState(42)
    X = rs.random(size=(n, d)) # input matrix
    v = np.arange(d) # v = [0, 1, 2, ...]
    y = X @ v + rs.randn(n)      # target vector
    return X, y
InĀ [10]:
# Generate artificial data set.
X, y = gen_data(1000, 50)
InĀ [11]:
X
Out[11]:
array([[0.37454012, 0.95071431, 0.73199394, ..., 0.52006802, 0.54671028,
        0.18485446],
       [0.96958463, 0.77513282, 0.93949894, ..., 0.42754102, 0.02541913,
        0.10789143],
       [0.03142919, 0.63641041, 0.31435598, ..., 0.50267902, 0.05147875,
        0.27864646],
       ...,
       [0.59548243, 0.53848475, 0.04274819, ..., 0.5590887 , 0.3681002 ,
        0.92417205],
       [0.88564332, 0.6376487 , 0.76785135, ..., 0.24720585, 0.3965254 ,
        0.98608331],
       [0.34848823, 0.70796571, 0.25040201, ..., 0.74899508, 0.52101091,
        0.86170671]])
InĀ [12]:
y
Out[12]:
array([533.60298728, 576.13366431, 568.12812277, 668.4824252 ,
       699.66909766, 589.5267887 , 531.56261332, 683.01173847,
       611.73297041, 670.37772058, 715.47894326, 512.1816071 ,
       548.33170173, 547.54283479, 662.85141039, 674.7433071 ,
       584.58899671, 502.14390141, 597.78030216, 587.18554943,
       610.16095671, 599.58362225, 689.52317345, 667.34908754,
       675.11757122, 489.63025321, 624.32731482, 599.46980079,
       656.0696828 , 616.87250615, 569.10848594, 585.34448774,
       686.32595749, 618.91998319, 543.05711379, 562.62773713,
       632.41948068, 641.94669327, 595.92341887, 569.30015718,
       576.3116163 , 662.12125613, 678.7499848 , 652.48800985,
       605.68772642, 456.88041884, 548.32996584, 605.71916399,
       569.31211126, 693.44582378, 619.44611593, 602.04776211,
       643.69561267, 561.92370543, 619.58205884, 650.20010529,
       621.64986239, 611.40352106, 651.06500718, 654.82993352,
       554.42711188, 614.83569792, 666.79837141, 605.95941933,
       547.06723134, 621.97741424, 640.67422926, 648.20977603,
       716.83188021, 651.91076231, 500.57651324, 640.58263339,
       590.58867185, 568.9109459 , 619.29435075, 574.05327058,
       605.73577567, 695.68606536, 537.35173283, 538.25900935,
       652.74850841, 532.19107575, 618.53131718, 607.25191034,
       623.66104662, 568.53264643, 656.79227203, 473.62135611,
       614.91356804, 605.70400499, 561.20970362, 642.99802025,
       696.33576079, 642.46787392, 517.61578357, 524.48600805,
       554.09076698, 504.63102695, 751.71289379, 642.30392682,
       653.14583051, 472.67124327, 657.32912328, 642.01537252,
       639.0730697 , 518.22110336, 585.76193653, 639.40326235,
       542.53993253, 665.17391855, 521.8277348 , 674.65774615,
       547.81463406, 613.37979208, 677.80999852, 668.40823459,
       591.20879292, 666.5390364 , 600.07729159, 603.61615141,
       623.09928054, 557.87895829, 611.51508082, 533.12080234,
       663.66743294, 661.22474711, 571.87826526, 657.56664128,
       604.71163486, 586.95552031, 612.11574268, 589.93409388,
       693.13586421, 594.011437  , 621.13502496, 639.67134288,
       593.32038589, 615.35191525, 574.94837843, 630.45834706,
       606.22652073, 567.97662018, 656.16447389, 683.91444323,
       636.82740691, 546.53414434, 615.66589955, 566.15028732,
       471.8834159 , 598.04627214, 583.03172294, 600.57121455,
       569.01119203, 659.50105997, 579.80222348, 553.34736908,
       575.22807612, 587.82569585, 598.24880681, 586.01660687,
       597.5625893 , 636.10118627, 592.5652126 , 499.17237395,
       600.7445794 , 606.46160281, 689.31079555, 655.5960607 ,
       524.48498942, 629.13670655, 650.13830504, 649.53546881,
       577.90265774, 518.61555531, 497.0424331 , 551.33073777,
       603.62498008, 551.73687332, 659.64630027, 597.91619141,
       578.51980672, 603.47164092, 690.03075666, 671.86238948,
       625.78996809, 661.8602284 , 729.8911097 , 572.71491354,
       554.41724055, 597.3477592 , 566.28212371, 521.36150236,
       547.10801184, 616.71925385, 578.4330811 , 635.18867233,
       689.55171783, 602.05577751, 548.18032969, 650.13863673,
       631.01827903, 604.80695929, 608.69866534, 588.49399087,
       621.87653762, 598.02005307, 610.12640543, 729.56443758,
       676.07104522, 503.42067556, 655.49705406, 682.92733235,
       727.22528663, 615.11945768, 582.02104882, 553.83331239,
       678.69101371, 502.678386  , 578.79048798, 676.21132215,
       661.11509085, 667.26781244, 698.36804444, 668.78721438,
       604.17754807, 540.4209061 , 616.69821954, 648.16955813,
       661.66103627, 603.96221627, 696.89319694, 631.71336684,
       586.30145422, 602.16015884, 510.73287034, 595.42278933,
       665.34861272, 543.01226639, 581.78384363, 573.34674288,
       654.69671441, 600.88320224, 570.38517583, 608.55258658,
       557.4073946 , 594.40440783, 595.78296903, 646.71636438,
       623.53197391, 568.44820564, 538.95665799, 621.72638544,
       642.60249889, 646.1445285 , 592.0743679 , 669.9656152 ,
       548.39466463, 631.92584084, 580.67809121, 575.80529514,
       644.87415208, 593.90147586, 607.63139237, 606.58725191,
       616.06135945, 575.00388626, 587.65064288, 500.36799009,
       699.20015182, 593.03502784, 624.29806561, 560.34691305,
       654.37674598, 609.87337569, 642.15627111, 722.04916187,
       531.16747377, 653.05954546, 573.65450751, 570.12047677,
       475.08051767, 705.69942419, 538.03018984, 570.75209522,
       641.1960784 , 610.44192587, 504.66054411, 679.0884266 ,
       672.84248442, 569.7572077 , 609.42141973, 722.5083582 ,
       709.49934623, 572.00105454, 635.08571362, 676.52917817,
       600.89729802, 549.34067907, 572.12848879, 643.7433099 ,
       607.23137432, 492.68094828, 675.06104281, 529.30200858,
       611.15121975, 739.02445615, 582.76424946, 660.38093602,
       587.62435075, 612.58744873, 607.65066802, 556.48612853,
       671.72899473, 639.11993107, 582.1178745 , 551.20980852,
       614.19399348, 531.73736577, 540.40089093, 736.66257032,
       679.48118563, 672.22324735, 684.19967443, 696.21307649,
       591.04139721, 623.01695893, 620.26676803, 533.03247725,
       680.31320793, 505.59565035, 629.66793349, 647.38144826,
       614.97689596, 618.89638808, 684.44546763, 614.68282113,
       643.55409054, 640.80228437, 538.100829  , 622.77267511,
       626.77515112, 670.39881215, 696.0788142 , 550.0778602 ,
       665.57932891, 668.86078953, 585.41222607, 711.82624601,
       546.9341624 , 634.62922714, 786.40233277, 607.07740213,
       639.62142236, 651.94818927, 554.2227789 , 570.41734727,
       558.2027597 , 619.00725771, 639.51684098, 636.30851736,
       739.1201776 , 620.27969997, 650.34883179, 595.54195358,
       479.3673627 , 631.82007192, 687.43099107, 598.72888314,
       601.10095784, 629.88351296, 619.20240238, 577.40078056,
       703.57668227, 665.78740222, 648.17914085, 774.92275966,
       781.90921519, 563.43925188, 643.39388458, 566.09404412,
       519.79678831, 594.74727539, 643.56960478, 664.18397602,
       690.8750879 , 731.99324603, 715.5187224 , 705.52520146,
       559.3362851 , 616.74816172, 667.00177421, 635.21573188,
       590.92167427, 578.07548852, 652.84903089, 592.99599913,
       615.22138438, 611.94903049, 632.78555043, 636.7236512 ,
       542.52025178, 632.73032389, 607.82876332, 611.04445021,
       655.13553522, 660.22297873, 638.31024622, 579.66858404,
       571.7927038 , 570.29038681, 651.72951633, 653.72358497,
       596.68962796, 645.9541516 , 648.4485629 , 671.98789041,
       687.22543313, 625.29068989, 599.56980914, 628.30832786,
       649.61449114, 566.60937051, 613.13883947, 562.37835437,
       587.51125863, 577.18067916, 681.32398531, 613.132692  ,
       591.05125516, 656.22650697, 644.75917497, 652.33058262,
       632.23440858, 650.04465811, 572.27177796, 698.54026897,
       616.33162248, 608.16870789, 644.52815974, 685.5823023 ,
       649.51158083, 659.0006635 , 645.5499349 , 541.42496491,
       672.60322035, 688.09550422, 642.61070693, 617.30952031,
       643.72646378, 787.79918576, 719.78186545, 656.30187592,
       571.51880472, 722.15541177, 513.45329347, 674.07730907,
       651.02504079, 666.44537741, 651.36917687, 591.37113418,
       607.01875954, 604.78906127, 645.36512819, 547.36342469,
       603.01378175, 589.32265454, 643.34615534, 612.72849284,
       679.30796188, 680.26642967, 677.78581026, 596.88495621,
       606.85436245, 650.4596151 , 624.15318038, 606.45678827,
       659.04682675, 649.33629557, 567.9283756 , 571.59229027,
       550.30252869, 553.11815621, 559.25987661, 694.3622247 ,
       632.00071189, 573.88304869, 634.00871259, 646.08180541,
       493.91106496, 641.43733246, 641.42231373, 607.7126624 ,
       531.51156738, 634.83672642, 616.59358544, 636.95107853,
       571.86548414, 509.36017835, 609.89304418, 646.78063686,
       581.19459218, 717.64703024, 579.76687635, 684.26623553,
       586.09791714, 531.28934636, 562.99109934, 697.61053687,
       586.79906326, 587.41256026, 612.73090422, 569.29562228,
       622.56436724, 599.25282599, 541.8207659 , 596.47404425,
       625.88948954, 596.49949992, 661.85854444, 719.5040606 ,
       512.82627844, 661.14458522, 493.08375925, 664.98330925,
       636.05695839, 624.11639716, 607.74566853, 602.0158472 ,
       644.55849141, 668.3162859 , 648.2019683 , 637.12463597,
       630.88928865, 520.92420437, 687.93503033, 549.39997795,
       585.01737779, 544.40292729, 505.6162989 , 610.63377011,
       644.70315104, 636.19666595, 740.93901813, 659.87991229,
       542.12112861, 628.23336427, 613.04608524, 602.75376798,
       621.3284228 , 647.42401243, 609.77404678, 696.82486181,
       514.54099296, 543.76223298, 554.65047141, 680.0459826 ,
       597.5740599 , 573.14884552, 610.07792897, 603.63099765,
       632.47344205, 553.69058746, 549.45207478, 613.52747865,
       545.79365067, 531.04345106, 594.89409584, 537.62980135,
       710.89463077, 536.96741227, 598.85231598, 657.84199334,
       658.67049632, 608.70002868, 526.38117565, 638.30729035,
       606.07245338, 682.04643892, 571.30179895, 604.51706603,
       571.24754468, 618.08891457, 718.21515755, 531.03097587,
       532.79068639, 545.82327881, 552.42521371, 577.43885495,
       576.83762354, 548.25556041, 672.36176898, 606.38250352,
       628.98545182, 706.26599122, 640.61681382, 700.72741296,
       491.56243555, 626.97576169, 676.20785856, 523.10264845,
       598.63204107, 679.01271397, 606.07482248, 667.17491807,
       644.27749008, 636.82573797, 620.72854646, 714.15523499,
       595.57564896, 605.96166716, 600.75723233, 681.96887017,
       554.26286299, 588.51394391, 580.63306922, 598.7405639 ,
       621.25137797, 640.84797153, 617.08749681, 672.19518372,
       637.68485235, 632.62071636, 703.64175583, 660.67474679,
       495.44905107, 604.09504246, 530.33149763, 693.86774842,
       513.7268996 , 601.44990932, 637.57628935, 758.26773573,
       621.43548783, 572.34241919, 652.88992303, 627.56135417,
       623.11572252, 606.35813937, 625.49863576, 702.41088271,
       654.35028581, 636.8930066 , 551.30788687, 597.52595054,
       557.45785057, 567.35624984, 677.32313942, 615.46581558,
       655.2705666 , 543.61512094, 594.32745871, 613.82706026,
       587.85567549, 629.4350337 , 595.65304923, 711.46127586,
       563.72590498, 613.45273989, 702.70130692, 599.48192021,
       659.33399463, 517.71079822, 629.30287401, 575.22698328,
       586.63317443, 679.86964867, 592.11086791, 635.34191572,
       740.1664258 , 473.71057609, 592.12749434, 582.44219562,
       581.6568571 , 659.31551499, 552.29400302, 515.77364034,
       536.05460574, 580.52952892, 622.70396584, 666.57656233,
       615.84988701, 688.21947633, 581.70544319, 581.84270347,
       592.62519429, 595.22889327, 591.41319213, 524.97864302,
       644.37658007, 584.79024451, 620.54656898, 676.24842862,
       702.3441974 , 510.96556776, 599.67546431, 501.51660459,
       719.86353412, 636.05922026, 711.62624341, 649.48967623,
       560.53318347, 588.16044366, 602.42413431, 603.97382993,
       621.83560155, 609.92379915, 471.63047473, 685.02997902,
       701.43015855, 592.57834997, 581.75065625, 621.61284929,
       555.12830835, 602.74036201, 584.66291011, 617.13682821,
       583.65566214, 635.38532553, 493.83781479, 674.34166563,
       592.14171789, 541.33812952, 620.11477598, 664.67664224,
       639.69045432, 587.69931575, 540.96620902, 592.59964586,
       565.94241844, 621.57501691, 549.30971686, 624.79228204,
       597.43503521, 506.26315454, 643.5618499 , 514.56704925,
       640.40201105, 596.73176285, 581.73976091, 603.56661306,
       641.81984096, 578.56346326, 707.36415539, 536.61590984,
       731.51717985, 634.98465732, 700.2702999 , 597.53790895,
       542.70601906, 714.20733167, 548.56526488, 697.90057861,
       595.34924103, 739.78404843, 656.12078043, 580.38821765,
       555.628187  , 474.84929602, 619.24834379, 592.83876597,
       656.54278653, 693.06845681, 679.48443453, 614.85358042,
       614.09890856, 659.74841638, 670.21572253, 496.11644326,
       594.72379753, 765.13774235, 599.93957237, 521.69606985,
       591.93319402, 624.98486045, 623.80916963, 660.66725011,
       544.59609113, 591.4370003 , 584.59328677, 696.33135916,
       707.61728437, 590.27637688, 596.55252407, 591.45481015,
       548.4701266 , 633.63595543, 649.70377691, 673.72902008,
       552.92932245, 606.40282452, 584.53434685, 752.62566101,
       571.75329378, 592.65535747, 614.80788995, 533.38738288,
       579.9495681 , 590.86712324, 626.63388383, 594.02701696,
       631.63489784, 656.78677556, 744.21891956, 587.69007297,
       657.45591031, 573.31804146, 686.250548  , 601.57952234,
       625.18003127, 721.84826587, 538.8234735 , 505.01823449,
       533.52939041, 580.37144198, 496.44100455, 599.84300393,
       575.36883759, 674.98528742, 625.42667175, 572.20235894,
       735.79558501, 602.22759331, 599.5237387 , 536.79394786,
       642.17814186, 648.42938982, 734.23976115, 451.83426741,
       571.46926646, 605.04313018, 598.6540863 , 659.95044659,
       528.50897078, 601.25171419, 552.25815484, 570.92400029,
       792.59658143, 613.13959698, 608.98998135, 651.79474898,
       618.99072373, 746.8657187 , 568.92827496, 647.31331895,
       627.17368929, 662.46131352, 625.36223152, 575.26758442,
       543.96515094, 599.53071918, 675.46697587, 749.05094747,
       696.84920026, 440.53615803, 528.86401269, 692.35398861,
       673.21872973, 593.5864642 , 595.78774755, 528.84468049,
       559.70091186, 625.87771933, 579.69442979, 674.43979652,
       572.17237554, 603.39481608, 550.75156319, 563.86864191,
       704.06112426, 555.08305997, 587.77069532, 562.36272498,
       583.63720165, 654.48570917, 614.36957386, 595.14021805,
       625.638566  , 629.92067973, 597.72558798, 581.25913459,
       512.77123543, 676.31559803, 659.44825695, 541.0299485 ,
       598.82381955, 614.90290364, 519.49045193, 589.90374501,
       559.49840085, 551.12579797, 563.86969334, 570.65384216,
       662.98125595, 735.27808143, 561.54839388, 604.47001057,
       580.28405674, 705.27980347, 652.16804913, 590.57759793,
       699.66325413, 617.4386229 , 647.89067269, 685.87420105,
       595.72940081, 706.55203936, 641.52383763, 595.39275637,
       523.89905244, 724.36611195, 518.84495704, 709.83691295,
       599.14970439, 642.88188836, 589.25484469, 634.19214729,
       666.08314297, 496.15239549, 548.90984186, 676.12896525,
       568.92809567, 582.84250843, 586.55623748, 699.20128293,
       593.81007788, 629.67580693, 600.0248951 , 688.27003095,
       631.88812573, 656.75764748, 627.65260717, 631.26607072,
       577.89363379, 624.12716934, 592.0800309 , 609.25219121,
       629.1641768 , 634.20099076, 587.22353007, 633.47057924,
       582.98909836, 593.24733883, 549.49017531, 637.8482583 ,
       619.09142305, 679.87318876, 663.9418921 , 579.19992964,
       667.38698284, 435.55521922, 651.43760131, 625.1144892 ,
       554.10561686, 602.23901348, 662.84512438, 688.61348317,
       711.49936184, 672.74983533, 528.77517752, 517.69225477,
       681.37258107, 612.7823142 , 464.38642279, 596.08251274,
       568.30684831, 560.22974029, 575.86177965, 549.33001301,
       682.19604038, 575.09686166, 625.40196453, 543.05207989,
       687.87796496, 658.49866923, 604.99315496, 615.93869084,
       578.37359144, 525.08622066, 607.38720982, 582.38630657,
       631.67616243, 557.72703224, 626.53495027, 562.70024147,
       679.14318766, 528.0827956 , 683.892276  , 660.53735251,
       588.04826164, 698.04267716, 558.80716068, 617.73761709,
       583.43653814, 579.46156286, 502.99029364, 701.87953073,
       627.69084159, 530.99909756, 597.38904177, 618.10516116])
InĀ [13]:
X.shape
Out[13]:
(1000, 50)
InĀ [14]:
%%time
# Compute optimal parameter vector.
w = np.linalg.inv(X.T @ X) @ (X.T @ y)
w
CPU times: user 5.76 ms, sys: 336 µs, total: 6.09 ms
Wall time: 5.65 ms
Out[14]:
array([-6.27668535e-03,  7.79377042e-01,  2.12007621e+00,  2.91175236e+00,
        3.87036693e+00,  5.10677902e+00,  5.99889354e+00,  6.96317023e+00,
        7.85544458e+00,  9.14518616e+00,  9.91401623e+00,  1.10795175e+01,
        1.18655076e+01,  1.29230253e+01,  1.39493048e+01,  1.50584688e+01,
        1.59630538e+01,  1.70070564e+01,  1.78704537e+01,  1.91984375e+01,
        1.98991713e+01,  2.09711802e+01,  2.20835181e+01,  2.31399674e+01,
        2.40855004e+01,  2.49690066e+01,  2.61665348e+01,  2.69094644e+01,
        2.80304559e+01,  2.89075191e+01,  3.00075384e+01,  3.10977986e+01,
        3.18331446e+01,  3.32259644e+01,  3.40089749e+01,  3.53181296e+01,
        3.59508608e+01,  3.69689455e+01,  3.76996546e+01,  3.89252973e+01,
        4.02508652e+01,  4.11066389e+01,  4.19419945e+01,  4.29471376e+01,
        4.39179113e+01,  4.51380313e+01,  4.60397828e+01,  4.69299566e+01,
        4.78870045e+01,  4.91419825e+01])
InĀ [15]:
%%time
# ...more efficient way
np.linalg.solve(X.T @ X, X.T @ y)
CPU times: user 2.96 ms, sys: 657 µs, total: 3.62 ms
Wall time: 3.01 ms
Out[15]:
array([-6.27668535e-03,  7.79377042e-01,  2.12007621e+00,  2.91175236e+00,
        3.87036693e+00,  5.10677902e+00,  5.99889354e+00,  6.96317023e+00,
        7.85544458e+00,  9.14518616e+00,  9.91401623e+00,  1.10795175e+01,
        1.18655076e+01,  1.29230253e+01,  1.39493048e+01,  1.50584688e+01,
        1.59630538e+01,  1.70070564e+01,  1.78704537e+01,  1.91984375e+01,
        1.98991713e+01,  2.09711802e+01,  2.20835181e+01,  2.31399674e+01,
        2.40855004e+01,  2.49690066e+01,  2.61665348e+01,  2.69094644e+01,
        2.80304559e+01,  2.89075191e+01,  3.00075384e+01,  3.10977986e+01,
        3.18331446e+01,  3.32259644e+01,  3.40089749e+01,  3.53181296e+01,
        3.59508608e+01,  3.69689455e+01,  3.76996546e+01,  3.89252973e+01,
        4.02508652e+01,  4.11066389e+01,  4.19419945e+01,  4.29471376e+01,
        4.39179113e+01,  4.51380313e+01,  4.60397828e+01,  4.69299566e+01,
        4.78870045e+01,  4.91419825e+01])

Exercise 3: Plot the time complexity of training the model!

InĀ [16]:
def fit_lin_reg(X, y):
    return np.linalg.solve(X.T @ X, X.T @ y)
InĀ [17]:
import time
import pandas as pd

time.time()

# How the running time grows with n?
data = []
for n in [5000, 10000, 15000, 20000, 25000, 30000, 35000]:
    print(n)
    X, y = gen_data(n, 50)
    t0 = time.time()
    fit_lin_reg(X, y)
    t1 = time.time()
    data.append({
        'n': n,
        'time': t1 - t0
    })
    
pd.DataFrame(data).set_index('n').plot()
5000
10000
15000
20000
25000
30000
35000
Out[17]:
<AxesSubplot: xlabel='n'>
InĀ [18]:
# How the running time grows with d?
data = []
for d in [50, 100, 200, 500, 1000, 1500, 2000]:
    print(d)
    X, y = gen_data(20000, d)
    t0 = time.time()
    fit_lin_reg(X, y)
    t1 = time.time()
    data.append({
        'd': d,
        'time': t1 - t0
    })
    
pd.DataFrame(data).set_index('d').plot()
50
100
200
500
1000
1500
2000
Out[18]:
<AxesSubplot: xlabel='d'>

Handling the Bias Term¶

To include a bias term in $d$-variate linear regression, one can add an all-ones column to the end of the input matrix, making it $X'=[X\ |\ 1]$, and append a bias parameter to the end of the parameter vector, $w'=[w\ |\ \beta]$. This creates a ($d+1$)-variate model is equivalent to a $d$-variate model with a bias term, because $X'w' = Xw + \beta$

InĀ [19]:
# Incorporate a bias term into the previous model.
X, y = gen_data(1000, 50)
o = np.ones((1000, 1))
X_pr = np.hstack([X, o])
w_pr = fit_lin_reg(X_pr, y)
w, beta = w_pr[:50], w_pr[50]
InĀ [20]:
w
Out[20]:
array([-3.06401464e-03,  7.82222930e-01,  2.12108232e+00,  2.91386326e+00,
        3.87228420e+00,  5.11026964e+00,  6.00067559e+00,  6.96542866e+00,
        7.85673963e+00,  9.14797989e+00,  9.91686530e+00,  1.10816627e+01,
        1.18689718e+01,  1.29260957e+01,  1.39518797e+01,  1.50615225e+01,
        1.59648657e+01,  1.70097186e+01,  1.78728642e+01,  1.92008891e+01,
        1.99009444e+01,  2.09740852e+01,  2.20863362e+01,  2.31414739e+01,
        2.40883621e+01,  2.49715111e+01,  2.61689865e+01,  2.69129379e+01,
        2.80332758e+01,  2.89101312e+01,  3.00091476e+01,  3.11004170e+01,
        3.18356050e+01,  3.32282182e+01,  3.40110034e+01,  3.53203475e+01,
        3.59551113e+01,  3.69712921e+01,  3.77020377e+01,  3.89276599e+01,
        4.02537764e+01,  4.11091654e+01,  4.19436299e+01,  4.29507093e+01,
        4.39199257e+01,  4.51407203e+01,  4.60427084e+01,  4.69329446e+01,
        4.78878303e+01,  4.91438904e+01])
InĀ [21]:
beta # bias (or intercept) term
Out[21]:
-0.06197959478839162

Linear Regression on Sparse Data¶

When the input matrix is sparse, meaning it has many zero values, traditional linear regression methods perform poorly. To avoid storage inefficiencies, the input matrix should be represented in a sparse matrix format such as COO, CSR, or CSC. To handle sparse data effectively, alternative training algorithms should be employed that only consider the non-zero locations in the input matrix.

InĀ [2]:
import numpy as np
import scipy.sparse as sp

def gen_sparse_data(n, d, density):
    X = sp.random(n, d, density=density, random_state=42) # input matrix
    v = np.arange(d)
    y = X @ v + np.random.RandomState(42).randn(n)   # target vector
    return X, y
InĀ [3]:
# Generate artificial data set.
X, y = gen_sparse_data(1000, 100, 0.01)
InĀ [4]:
X
Out[4]:
<1000x100 sparse matrix of type '<class 'numpy.float64'>'
	with 1000 stored elements in COOrdinate format>
InĀ [5]:
y
Out[5]:
array([ 8.31573963e+01, -1.38264301e-01,  6.47688538e-01,  6.95470361e+01,
        8.64440171e+01,  2.21265946e+01,  3.92455292e+01,  7.67434729e-01,
        4.22673463e+00,  5.42560044e-01,  2.18408047e+01,  1.10375192e+01,
        1.03918651e+01,  5.47211613e+01, -1.66547943e+00,  1.97022890e+01,
       -1.01283112e+00,  2.80787047e+01,  6.96266404e-01,  2.23033667e+01,
        1.46564877e+00,  2.30734549e+01,  3.49393826e+01, -1.42474819e+00,
       -5.44382725e-01,  1.10922590e-01,  4.16321594e+01,  3.75698018e-01,
        5.49280248e+00,  3.97282594e+01, -6.01706612e-01,  1.85227818e+00,
       -1.34972247e-02, -1.05771093e+00,  7.60710146e+01,  4.87893739e+00,
        2.08863595e-01,  3.83735573e+01,  8.49340845e+01,  7.63919181e+01,
        9.46626266e+00,  1.71368281e-01,  5.58220928e+01,  6.86964546e+01,
       -1.47852199e+00, -5.28017778e-01,  9.83779634e+00,  1.05712223e+00,
        3.43618290e-01,  2.54551279e-01,  5.83283462e+00, -3.85082280e-01,
        4.78181839e+00,  6.11676289e-01,  2.07980776e+01,  5.18802057e+01,
        4.73186585e+01,  1.97540367e-01,  3.31263431e-01,  9.75545127e-01,
        1.03568355e+01, -1.85658977e-01, -1.10633497e+00, -1.19620662e+00,
        4.12787590e+01,  1.35624003e+00,  5.05158682e+01,  5.77108280e+01,
        3.02452754e+01, -6.45119755e-01,  1.02726922e+02,  5.49761660e+01,
       -3.58260391e-02,  6.88264052e+00,  2.03846593e+01,  1.18143324e+01,
        2.76285836e+01, -2.99007350e-01,  3.97305882e+01, -1.98756891e+00,
        2.58469048e+01,  3.57112572e-01,  1.47789404e+00,  1.11348614e+02,
        3.13910873e+01,  2.71888802e+00,  6.79054939e+00,  6.77364818e+01,
       -5.29760204e-01,  8.23053246e+00,  9.70775493e-02,  5.13648346e+01,
        2.60305176e+00,  2.45473793e+01, -3.92108153e-01, -1.46351495e+00,
        3.99643569e+00,  5.10159902e+01,  5.11345664e-03,  5.87251946e+01,
        5.16514622e+01, -4.20645323e-01,  2.53658381e+01, -8.02277269e-01,
        1.28791475e+00,  4.04050857e-01,  6.76665596e+01,  1.74577813e-01,
        2.98084866e+01, -7.44459158e-02,  6.79093020e+00,  4.29789834e+01,
        6.02302099e-02,  2.04598018e+01, -1.92360965e-01,  1.44403438e+01,
        1.95949131e+01, -1.16867804e+00,  1.14282281e+00,  3.61953477e+01,
        3.43137744e+01,  1.88298575e+01,  2.48449536e+01, -1.40185106e+00,
        5.86857094e-01,  5.29688872e+01,  1.26765140e+02,  9.75571757e+01,
        1.14887010e+01,  2.76550142e+01,  1.43791628e+02,  6.85629748e-02,
       -1.06230371e+00,  4.73592431e-01,  4.55548677e+00,  1.54993441e+00,
       -7.83253292e-01,  4.49772935e+01,  8.13517217e-01,  2.02839835e+01,
        2.27459935e-01,  4.29545612e+01,  5.82367303e+01,  4.33288648e+01,
        4.60816587e-01,  2.89864031e+01, -1.23695071e+00,  6.54689714e+00,
        4.10995885e+01,  8.84171406e+01,  4.04387533e+01,  4.99520453e+01,
       -6.06055809e-02,  6.79186970e+00,  2.93072473e-01,  1.86771497e+01,
        3.02166893e+01,  4.73832921e-01, -1.19130350e+00,  6.56553609e-01,
        9.45499187e+00,  7.46181396e+01,  6.58723853e+00,  1.42093651e+01,
        9.63376129e-01,  4.12780927e-01,  7.07775323e+00,  1.89679298e+00,
        8.41290791e+01,  6.24849649e+01,  1.82818663e+01, -8.15810285e-01,
        3.27215858e+01,  1.22751011e+01,  5.44572489e+01,  3.21836378e+00,
        1.19929473e+01,  2.77868493e+01,  5.80690546e+01,  2.72016917e+00,
        1.38297775e+00,  9.72690260e+00,  9.80141738e+01,  1.87712364e+01,
        1.01685084e+02,  7.14000494e-01,  1.11945097e+00,  4.21273124e+01,
       -8.46793718e-01, -1.51484722e+00, -4.46514952e-01,  8.56398794e-01,
        5.40698615e+01, -1.24573878e+00,  3.63481379e+01,  7.25758028e+01,
        1.96335659e+01,  1.53725106e-01,  1.00433263e+02,  1.43237010e+02,
        4.73724577e+00,  3.73670227e+01,  1.08305124e+00,  6.73549627e+01,
       -1.37766937e+00, -9.37825040e-01,  4.77105981e+01,  5.44979552e+01,
        5.15047686e-01,  3.85273149e+00,  5.70890511e-01,  1.13556564e+00,
        7.73548571e+01,  6.51391251e-01, -3.15269245e-01,  1.42822049e+00,
       -7.72825215e-01,  1.59616992e+01, -4.85363548e-01,  8.18741394e-02,
        1.44402904e+01,  3.86292081e+00,  1.00266907e+02,  3.76849022e+01,
       -4.71931866e-01,  1.08895060e+00,  2.16212000e+01, -1.07774478e+00,
        2.06898431e+01,  6.79597749e-01, -3.54486453e-01,  2.44512499e+01,
        1.36196077e+00,  8.32323471e+01,  2.93072279e+01,  7.33944448e+01,
       -2.02514259e+00,  2.25333934e+01,  1.66713020e+01,  8.52433335e-01,
       -7.92520738e-01,  5.61620463e+01,  8.90243046e+00,  8.65755194e-01,
        1.29635554e+01,  7.68271729e+00,  4.27688855e+01,  1.28262901e+01,
        1.76545424e+00,  4.04981711e-01,  4.69833104e+00,  9.17861947e-01,
        2.12215620e+00,  8.60575729e+01, -1.51936997e+00, -4.84234073e-01,
        8.22274809e+00,  5.55100951e+00,  4.43819428e-01,  7.74634053e-01,
        1.06369479e+02,  2.19213539e+01, -3.24126734e+00,  6.64636958e+00,
       -2.52568151e-01, -1.24778318e+00,  2.82057736e+01,  4.73247527e-01,
        1.00599230e+02,  4.93961865e+01,  1.44127329e+00,  1.40189463e+02,
        3.93591458e+01,  2.56733063e+01, -9.81508651e-01,  4.62103474e-01,
        1.76568675e+01,  5.40405966e+01,  6.98020850e-02, -3.85313597e-01,
        4.96773188e-01,  6.62130675e-01,  2.61945903e+01, -1.23781550e+00,
        3.30471221e+01,  4.78361330e+00,  7.61129175e+00,  1.05746970e+00,
        2.80991868e-01, -6.22699520e-01,  8.79536879e+01, -4.93000935e-01,
        2.90190986e+01,  2.76585711e+01,  9.35892809e-01, -6.92909595e-01,
        3.17110434e+01,  1.09949986e+01,  2.34156377e+01,  1.48480502e+01,
       -8.28995011e-01,  1.18766365e+01,  7.47293605e-01,  2.43135695e+01,
        1.26711626e+02,  1.17327383e-01,  1.27766490e+00, -5.91571389e-01,
        5.47097381e-01,  5.67527170e+01, -2.17681203e-01,  6.57839713e+00,
        8.25416349e-01,  8.13509636e-01,  5.61931174e+01,  1.75633693e-01,
        6.64698341e+01,  1.54212157e+00,  3.44555168e+01, -1.30143054e-01,
        9.69959650e-02,  5.95157025e-01,  7.66131428e+01,  8.21353055e+01,
        1.15169397e-01,  3.66658975e+01,  1.15811087e+00,  7.91662694e-01,
        6.24119817e-01,  6.28345509e-01,  2.02502433e+01,  5.39175357e+01,
        3.82728983e+01,  6.23718151e+01,  1.85787209e+01,  6.71337391e+00,
        4.04457891e+01,  4.52657584e+01,  2.06144367e+01, -5.63724553e-01,
        1.99416430e+02,  2.43687211e-01,  8.98584268e+01,  7.51633374e+00,
        1.19782087e+01,  1.24633695e+02,  7.02153059e+01, -1.40746377e+00,
        5.79974267e+01,  2.65373568e+01,  2.66271049e+01,  8.85130758e+01,
        8.57659623e-01, -1.59938530e-01, -1.90162079e-02,  4.46480234e+00,
        4.34368959e+00,  2.06567791e+01,  3.59505607e+01,  2.28317267e+01,
        7.22629805e+01,  1.36033719e+01, -1.08760148e-01,  7.68308484e+01,
        3.07395464e+00,  6.64547789e+01,  2.24092482e-01,  2.95300730e+00,
        4.82770598e+01, -7.73009784e-01,  4.15603092e+01,  4.97998291e-01,
        9.33562226e+01,  2.19014259e+00,  3.78220633e+01,  1.25486238e+02,
        4.73801860e+01,  1.83342006e-01,  3.80737846e+01, -8.08298285e-01,
        1.39005726e+02,  6.08822885e+01, -2.12389572e+00,  3.88696494e+01,
       -7.59132662e-01,  1.50393786e-01,  1.35825652e+01,  3.61440335e+00,
        9.50423838e-01, -5.76903656e-01, -8.98414671e-01,  4.91919172e-01,
        3.02120277e+01,  2.67616669e+00,  2.42169002e+01, -4.69175652e-01,
        8.24297616e+01,  1.35387237e+00, -1.14539845e-01,  1.08188426e+02,
       -1.49572208e+00, -5.99375023e-01,  7.00940125e+01,  2.22740112e+01,
        1.27243791e+02,  2.55045644e+00, -1.06762043e+00,  6.40684534e+00,
        1.20295632e-01,  2.72332007e+01,  7.11614878e-01,  5.89679716e+01,
        5.97441214e+01,  1.61640335e+02,  1.39718940e+01, -7.48486537e-01,
        1.55115198e+00,  3.34144427e+01,  5.16432718e+01,  6.65966229e+01,
        2.06074792e+00,  1.75534084e+00,  9.20578063e+01,  9.85616673e+00,
        1.44463077e+01,  1.36863156e+00, -9.64923461e-01,  2.05624597e+01,
        1.05842449e+00,  3.58637432e+01, -6.37989214e-01,  6.54490662e+01,
        1.83611619e+01,  7.17542256e-01,  1.41255951e+02,  7.40947804e-02,
        3.05577524e+01, -1.38010146e+00,  3.42395373e+01,  1.18725059e+01,
        3.84065449e-01, -3.26947481e-02, -2.06744210e+00,  1.58587098e+02,
        4.66790464e+01,  6.69672549e-01,  1.26456017e+00,  3.40282466e+00,
       -5.75542776e-02,  4.11257648e+01, -6.26790973e-02,  2.77832205e+01,
        7.43712828e+01,  5.04046516e-01, -5.30257618e-01, -7.92872832e-01,
        1.03721086e+01,  4.98694723e+01,  7.99487001e+00,  1.62559326e+01,
        7.63678036e+01,  3.76920016e+01, -6.99725508e-01,  1.31942993e+01,
        5.14391877e+01,  7.94204690e+01,  7.17136927e+00,  7.51908861e+01,
        3.72234846e+01,  5.65192141e+00, -2.75051697e-01, -2.30192116e+00,
       -1.51519106e+00,  1.36687427e+00,  1.64496771e+00, -2.49036040e-01,
        5.32414315e+01,  4.30709040e+01,  3.07888081e+00,  1.11957491e+00,
       -1.27917591e-01,  1.52634570e+01, -1.60644632e+00,  5.57105335e+01,
        1.16309848e+02,  6.09417165e+01, -6.46572884e-01,  3.21627539e+01,
        1.68714164e+00,  8.81639757e-01,  7.46772345e+01,  1.47994414e+00,
        7.73683076e-02,  8.13722505e+01,  1.59644935e+01,  6.79331170e+00,
        4.87763742e+01, -1.90338678e-01,  2.49426678e+01,  7.03843620e+00,
        9.26177548e-01,  3.17672255e+00,  1.16686054e+02,  5.62969237e-01,
       -6.50642569e-01,  1.07150542e+02,  8.39534325e+01, -8.63990770e-01,
        4.85216279e-02,  4.37976798e+01,  2.95769697e+00, -5.02381094e-02,
        7.50823129e+00, -9.07563662e-01,  1.87978653e+01,  2.90833958e+01,
        5.00917188e-01,  8.11837691e+01,  9.93323054e-02,  7.51387123e-01,
       -1.66940528e+00,  5.43360192e-01, -6.62623759e-01,  5.70598669e-01,
        2.01561507e+01,  2.27567319e+01, -1.62754244e+00,  3.95768443e+01,
        2.59722502e-01,  1.61686476e+01,  6.38592459e-01,  4.19087298e+00,
       -6.60797986e-02,  1.36998691e+02, -6.51836108e-01,  4.73986713e-02,
       -8.60413365e-01,  1.89725444e+01,  1.00629281e+00,  4.24739081e+01,
        8.35692112e-01,  7.93387525e+01,  5.29804178e-01,  4.41733903e+01,
        3.86973932e+01, -7.96895255e-01,  7.37053129e+01,  4.74841061e-01,
        5.10833966e+01, -6.03985187e-01,  1.50989125e+01,  1.21745387e+01,
        1.16778206e+00,  2.54420843e-01,  3.37602662e-01, -4.11876966e-01,
        1.26717683e+01,  5.73758509e+01,  3.94452142e-01,  8.87406702e+01,
        3.81287869e+00,  2.07540080e+00,  6.55968964e+00, -3.26023532e-01,
        1.20121392e+00,  1.62647187e+00,  3.63952241e+01, -1.00808631e+00,
        2.78024759e+01,  2.77601207e+01,  1.64778359e+01,  1.07462226e+01,
        1.89941344e+01,  6.66712498e+01,  6.56382984e+01,  2.77760398e+01,
        2.35614558e-01,  6.12219163e+01, -1.47858625e+00,  4.49509416e+01,
        3.38496407e-01,  4.54953373e+01,  7.97833334e+01,  2.31566775e+01,
        1.81866255e-01,  6.93919210e+01,  2.07867859e+01,  1.03899509e+02,
        8.30335817e-01, -8.56083826e-01,  5.79588325e+01,  2.90545928e+01,
        7.00914566e+00,  4.23748679e+01,  1.16877792e+02,  6.96157173e+01,
        2.49616759e+01,  4.54150876e+01,  2.61693212e+01,  3.77300493e-01,
        7.03746235e+00,  1.15177575e+01,  4.25185313e+00,  2.36037069e+01,
        4.13434903e-01,  5.60161717e+01,  5.32575596e+01,  3.21935944e+01,
       -1.58533687e-01,  8.55549632e+01,  5.74519657e+01, -5.55846709e-02,
        2.33751893e+01,  1.09625159e+01,  3.22877584e+01,  1.29221182e-01,
        1.09394795e-01,  7.25766624e-01,  1.90278093e+00,  2.23884024e-01,
       -7.90474455e-01,  8.33877891e+01,  1.88202450e+00,  3.07089917e+01,
        3.40192831e+00, -5.11215676e-01,  6.19587883e+01,  9.18573650e+00,
        5.57249123e-02,  2.58102886e+01,  2.44395609e+01,  1.03296056e+01,
       -1.58007899e-01,  5.26611604e+01,  4.75545775e+01, -1.65485667e+00,
        6.46791029e+01,  7.33179672e-02,  1.08647001e+01, -1.29507877e+00,
        2.83407180e+01,  1.66902153e+00, -2.59591351e-01,  4.15544195e+01,
       -2.45743064e-01,  5.02004600e+01,  2.07908923e+01,  1.00445084e+02,
       -2.30934530e-01,  8.28041754e+01,  6.84603638e+01,  4.93572046e+01,
        1.10794335e+01,  5.82696078e+00,  2.57335980e+00,  5.92184340e-02,
        1.39292919e-02,  5.70872688e+01,  1.98084761e-01,  1.80787078e+01,
        3.37324479e+01,  3.10348253e+01,  4.64805350e+00,  1.33254199e+01,
       -7.12845783e-01,  2.83681158e+01, -2.54977217e-01,  1.50399299e+00,
       -2.65096981e+00,  5.52687929e+01,  1.24608519e+00, -2.07339023e+00,
        2.15280473e+01,  4.16516766e+01, -1.32687315e+00, -7.77816688e-01,
       -1.11057585e+00,  3.94740357e+01,  9.35678393e-01,  4.98835954e+01,
        7.21672064e-01, -1.12905177e+00, -5.24520266e-01,  4.89374561e-01,
        7.33921370e+00,  4.18364538e+01, -2.40325398e-01,  3.29684501e+01,
        7.31587477e+01,  4.44263311e-01, -3.60966166e-01,  1.04274169e+02,
       -1.08106333e+00,  6.15935607e-01,  3.52020291e+01,  6.39846908e+01,
        3.26133022e-01, -1.25111358e+00,  5.74452939e+01,  3.65038363e+01,
       -5.22723021e-01,  1.98193555e+01,  2.14581715e+02, -1.40846130e+00,
       -1.55662917e+00,  6.01959038e+01, -1.28042935e+00,  1.75479418e+00,
        3.99319672e+01,  1.04210976e+02,  2.11017467e-01, -9.67131119e-02,
        4.15817490e+00,  3.99136114e-01, -3.76347024e-02,  1.10330188e+00,
        1.14227649e-01,  1.50301761e-01,  6.31935479e+01,  1.34684251e+01,
        5.87727582e+01,  5.93318059e+01, -1.34818542e+00,  7.43264094e-01,
        1.26272946e+01,  3.80599387e-01,  1.84339331e-02,  5.67971095e-01,
       -2.75828890e-01,  6.52733524e+01,  1.52913838e+01,  1.01548888e+02,
        4.08252756e-01,  9.20331091e+01,  1.02915564e+00,  1.01133914e+01,
        2.98162802e+01,  9.82690984e-01,  1.66547444e+00,  1.82581408e+01,
       -1.84087423e+00, -1.27957697e+00,  1.09184425e+00,  8.09158300e+01,
        5.17659020e-01,  1.60650632e+01,  1.86766764e-01, -7.55382932e-01,
        9.02787127e+00, -1.40666110e+00,  3.81134789e+01, -1.35168461e+00,
       -9.75873253e-01,  5.37903456e+01, -9.49398889e-01,  2.73826778e+01,
        4.93317901e-01,  4.27184385e+00, -8.58357780e-01,  7.00309879e-01,
       -5.75637826e-01,  3.91745640e+01,  2.56008454e+00, -9.60598997e-02,
        1.14927333e+00, -7.03176425e-01,  7.60889028e+01,  1.13391785e+01,
       -6.26967058e-01,  1.09636746e+02,  3.69662321e+01, -5.62466776e-01,
        3.35971777e+01,  4.23290339e+01,  6.21809962e-01,  1.34528632e+01,
        2.77490114e+01, -2.47518636e-01,  8.67392650e+00,  6.20672098e-01,
        2.05359534e+02,  9.19366707e+00,  3.80197851e-01,  1.04310509e+01,
        8.11139396e+00,  1.08078073e+00,  2.49390731e+01,  1.85169074e+01,
        3.35011023e+01, -6.14828580e-02,  8.43396125e+01,  1.70930427e+01,
        2.73798311e+01, -1.27674858e+00,  7.77179029e+01,  1.13890736e+02,
        1.28881450e+01,  2.75019871e+01,  3.54570631e+01,  4.56850498e+01,
        9.38283806e-01,  4.18617249e+01,  9.61207769e-02,  5.88341198e+01,
       -4.34496227e-01,  1.74726774e+01,  2.22133772e-01,  1.11095624e+02,
        6.38660478e+01,  8.02964325e+01,  3.25842082e+01,  1.91862309e+01,
        1.44697788e+00,  1.96554777e-01,  3.70896444e+01,  1.01181811e+01,
        2.67050266e-01,  8.44466714e+00,  3.02996058e+01,  1.06548038e+00,
       -5.17288450e-01,  4.19938171e+00,  2.29889812e+00,  9.64601871e+01,
        1.99688389e+01,  4.35057524e+01,  1.57957215e+00, -5.22860027e-01,
        2.31667894e+01, -2.81784609e-01,  4.91509552e+01, -9.18651946e-01,
        5.84600305e+01, -7.67797565e-01,  4.86931946e+01,  2.34214733e-01,
        5.17423260e+01,  1.79986887e+01,  5.41889089e+01,  3.62268791e+01,
        3.89907169e+01,  1.13376845e+02,  7.30223401e+00,  6.10600762e+01,
        5.68090106e+01,  3.00041252e+01,  2.89168644e-01,  2.45530014e+00,
       -6.37739984e-01,  3.63630036e+01, -6.23140526e-01,  4.60559502e+00,
        8.82810085e+01,  1.18901653e+00,  1.42050425e+00,  3.46602959e-01,
       -8.32355573e-01,  2.91068583e+01,  2.07953860e+01,  6.32931818e-01,
        5.67307179e+01,  2.60236926e+01,  5.91512652e+01,  9.37678393e+00,
        7.05373709e+01,  1.04075165e+02,  5.46855779e+01,  3.34456790e-01,
        7.50952173e+00,  9.81238915e+00, -1.76947227e-01, -7.98297245e-01,
       -1.37931923e+00,  9.05900531e+00,  2.36846302e+01,  1.79455786e+00,
       -5.17611299e-01,  2.23787952e-01,  2.27722121e+01,  3.38545236e+01,
        2.52693243e+00, -5.30868773e-01, -4.89439443e-01,  4.10845235e+01,
        1.38364420e+02,  3.33142153e+00,  5.83928185e-01,  1.37773186e+01,
        8.39384992e+01,  4.97237317e+01,  3.06129262e+01,  5.07274031e-01,
        1.01233066e+02,  6.74484791e+00,  8.16545686e+01,  3.57240104e+01,
        3.07790020e+00,  1.46713686e-01,  1.20650897e+00,  6.23745591e+00,
        3.68673309e-01, -3.93338812e-01,  9.98390819e+00,  1.27845186e+00,
        1.10046846e+01,  2.77530573e+01,  6.89558569e+01,  7.46253566e-01,
        1.31758718e+01,  3.27233434e+01, -3.07778235e-01,  1.29516575e+01,
        5.88581157e+01,  1.57745328e+00,  8.09891127e-01,  2.79021526e-01,
        6.07896510e-01,  1.16786284e+02,  2.16282543e+01,  6.61882507e+00,
        1.07363175e+00,  4.56163361e+01,  5.84334713e+01, -7.00120815e-01,
        6.50238090e+01,  4.04713899e+01, -5.58921847e-01,  3.77211875e-01,
        4.70644844e+01,  1.03148048e+02,  1.93713395e+00,  1.02059968e+01,
       -1.44801390e+00,  3.16304991e+00,  1.85016890e+01,  1.37667145e+01,
        1.14316375e+02,  4.61608040e+00,  1.07219611e+00, -5.64078631e-01,
        5.70811884e+01,  3.52930388e+01,  1.33897421e+02, -9.91758638e-02,
        8.32627671e+01, -1.07008477e+00, -1.52552517e+00, -6.91908070e-01,
        3.24545253e+01,  3.26034024e+01,  5.47388350e+00,  3.52055397e-01,
        6.97386061e+01,  6.90124040e+01, -8.21511784e-02,  7.03871380e+01,
        3.42725346e-01,  2.84477548e+01,  5.25396229e+00,  1.10229165e+02,
        2.78871522e+01,  5.16295307e+01,  2.35367064e+01,  5.12328476e+01,
        6.26622340e+01,  1.44011722e+00,  5.79370306e+01,  1.80094043e+00,
        3.94910677e+00,  2.34062385e+01,  5.81697112e+00, -6.81051657e-01,
        1.16159658e+01,  2.00766350e+01, -4.46183433e-01,  4.61046771e+01,
        1.00084701e+02, -2.42387933e+00,  3.54931376e+01,  7.60414656e-01,
        8.20522539e-01,  3.38284438e+01, -9.66976143e-01,  1.39267160e+01,
        1.29677682e+01, -1.15836469e+00,  4.08262399e+01,  5.98465277e+00,
        1.43087165e+01,  2.68858390e-02,  2.71793446e+01,  4.22579496e+01,
        1.22059505e+02, -6.81984248e-01,  9.39162928e+00,  6.40787645e+00,
        1.39675424e+01,  6.40842861e-01,  5.87374343e+01,  5.72582781e-01])
InĀ [7]:
X.nnz / (X.shape[0] * X.shape[1])
Out[7]:
0.01
InĀ [27]:
# Compute optimal parameter vector.
n, d = X.shape
A = sp.linalg.LinearOperator((d, d), matvec=lambda v: X.T @ (X @ v))
b = X.T @ ys
w = sp.linalg.cg(A, b)[0]
InĀ [28]:
w
Out[28]:
array([4.64339975e-02, 3.26604917e-01, 2.29059762e+00, 2.97799922e+00,
       4.43162378e+00, 4.36446042e+00, 6.21333225e+00, 6.55339954e+00,
       7.05355659e+00, 1.03196805e+01, 1.10572575e+01, 1.15858278e+01,
       1.17330927e+01, 1.33788421e+01, 1.37388481e+01, 1.54351381e+01,
       1.57893995e+01, 1.64644072e+01, 1.72860014e+01, 1.92960255e+01,
       2.07635357e+01, 2.11645421e+01, 2.22330136e+01, 2.42350792e+01,
       2.30280771e+01, 2.53374573e+01, 2.57806693e+01, 2.71485673e+01,
       2.78002859e+01, 2.84697530e+01, 3.10263660e+01, 3.07551333e+01,
       3.24995461e+01, 3.37672839e+01, 3.38745444e+01, 3.50316613e+01,
       3.63488936e+01, 3.82325952e+01, 3.76995083e+01, 3.81634254e+01,
       3.99686573e+01, 4.14135827e+01, 4.15741918e+01, 4.23832713e+01,
       4.33668312e+01, 4.48930318e+01, 4.77707684e+01, 4.46614983e+01,
       4.73798992e+01, 4.90847053e+01, 4.98237101e+01, 5.18169738e+01,
       5.11428677e+01, 5.30883577e+01, 5.32198253e+01, 5.48964684e+01,
       5.54724743e+01, 5.81113882e+01, 5.78044648e+01, 5.82857037e+01,
       6.10110395e+01, 6.17879142e+01, 6.14003995e+01, 6.27920372e+01,
       6.41256972e+01, 6.48996134e+01, 6.45764947e+01, 6.74807125e+01,
       6.83645753e+01, 6.91000515e+01, 7.06680225e+01, 7.14375467e+01,
       7.14460225e+01, 7.27029937e+01, 7.41242247e+01, 7.44094776e+01,
       7.61913445e+01, 7.68914070e+01, 7.76538667e+01, 7.96193506e+01,
       7.97402505e+01, 8.03906231e+01, 8.26255295e+01, 8.37349515e+01,
       8.36416654e+01, 8.50963847e+01, 8.59888530e+01, 8.69106689e+01,
       8.80390935e+01, 8.92648847e+01, 9.14780571e+01, 9.08513364e+01,
       9.08726437e+01, 9.32493484e+01, 9.45125753e+01, 9.46143945e+01,
       9.62206304e+01, 9.69323206e+01, 9.84048975e+01, 1.00099921e+02])

The Boston Housing Problem¶

The Boston Housing dataset was originally published by Harrison and Rubinfeld in 1978, in a paper titled Hedonic prices and the demand for clean air. The authors used the dataset to investigate the relationship between air quality and housing prices in the Boston metropolitan area. Since then, the dataset has been widely used in machine learning and statistics research as a benchmark for regression tasks.

Each row in the data table contains various attributes of a a district in Boston. The last attribute is the median value of owner occupied homes in the district. We will create regressors that estimate this attribute from other attributes of the district.

Exercise 1: Load the Boston Housing data set to DataFrame and display basic statistics about it!

InĀ [1]:
# Extract column names.
fname = '../_data/housing_names.txt'
columns = []
for line in open(fname):
    if line.startswith('    ') and line[4].isdigit():
        columns.append(line.split()[1])
InĀ [2]:
columns
Out[2]:
['CRIM',
 'ZN',
 'INDUS',
 'CHAS',
 'NOX',
 'RM',
 'AGE',
 'DIS',
 'RAD',
 'TAX',
 'PTRATIO',
 'LSTAT',
 'MEDV']
InĀ [3]:
# Load to DataFrame.
import pandas as pd
df = pd.read_csv('../_data/housing_data.txt', sep='\t', names=columns)
df
Out[3]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO LSTAT MEDV
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 5.33 36.2
... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 0 0.573 6.593 69.1 2.4786 1 273.0 21.0 9.67 22.4
502 0.04527 0.0 11.93 0 0.573 6.120 76.7 2.2875 1 273.0 21.0 9.08 20.6
503 0.06076 0.0 11.93 0 0.573 6.976 91.0 2.1675 1 273.0 21.0 5.64 23.9
504 0.10959 0.0 11.93 0 0.573 6.794 89.3 2.3889 1 273.0 21.0 6.48 22.0
505 0.04741 0.0 11.93 0 0.573 6.030 80.8 2.5050 1 273.0 21.0 7.88 11.9

506 rows Ɨ 13 columns

InĀ [4]:
# Check column data types.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  LSTAT    506 non-null    float64
 12  MEDV     506 non-null    float64
dtypes: float64(11), int64(2)
memory usage: 51.5 KB
InĀ [5]:
# Basic column statistics.
df.describe().T
Out[5]:
count mean std min 25% 50% 75% max
CRIM 506.0 3.613524 8.601545 0.00632 0.082045 0.25651 3.677083 88.9762
ZN 506.0 11.363636 23.322453 0.00000 0.000000 0.00000 12.500000 100.0000
INDUS 506.0 11.136779 6.860353 0.46000 5.190000 9.69000 18.100000 27.7400
CHAS 506.0 0.069170 0.253994 0.00000 0.000000 0.00000 0.000000 1.0000
NOX 506.0 0.554695 0.115878 0.38500 0.449000 0.53800 0.624000 0.8710
RM 506.0 6.284634 0.702617 3.56100 5.885500 6.20850 6.623500 8.7800
AGE 506.0 68.574901 28.148861 2.90000 45.025000 77.50000 94.075000 100.0000
DIS 506.0 3.795043 2.105710 1.12960 2.100175 3.20745 5.188425 12.1265
RAD 506.0 9.549407 8.707259 1.00000 4.000000 5.00000 24.000000 24.0000
TAX 506.0 408.237154 168.537116 187.00000 279.000000 330.00000 666.000000 711.0000
PTRATIO 506.0 18.455534 2.164946 12.60000 17.400000 19.05000 20.200000 22.0000
LSTAT 506.0 12.653063 7.141062 1.73000 6.950000 11.36000 16.955000 37.9700
MEDV 506.0 22.532806 9.197104 5.00000 17.025000 21.20000 25.000000 50.0000

Exercise 2: Build a univariate linear model for each column and measure it's RMSE (root mean squared error)! Use the full data set both for for model building and error measurement!

$$\mathrm{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (\hat{y}_i - y_i)^2}$$
InĀ [6]:
y = df['MEDV'].values
y = y - y.mean()

for column in columns[:-1]:
    x = df[column].values
    x = x - x.mean()
    
    w = (x @ y) / (x @ x) # optimal model parameter
    yhat = x * w          # prediction
    rmse = ((yhat - y)**2).mean()**0.5 # root mean squared error
    
    print(column, rmse)
CRIM 8.467038200100824
ZN 8.570396495772854
INDUS 8.04153105080589
CHAS 9.045800910882107
NOX 8.306881987569504
RM 6.603071389222562
AGE 8.510228018625197
DIS 8.896422965780745
RAD 8.492632800301259
TAX 8.117097716353989
PTRATIO 7.915314271320455
LSTAT 6.20346413142642

Exercise 3: Build a multivariate linear model and measure its RMSE! Use the full data set both for for model building and error measurement!

InĀ [7]:
import numpy as np

X = df[columns[:-1]].values
y = df['MEDV'].values
w = np.linalg.solve(X.T @ X, X.T @ y) # optimal parameter vector
yhat = X @ w                          # prediction
rmse = ((yhat - y)**2).mean()**0.5    # root mean squared error
rmse
Out[7]:
5.065952606279903

Exercise 4: Introduce a bias term into the multivariate linear model!

InĀ [8]:
X2 = np.hstack([X, np.ones((X.shape[0], 1))])
y = df['MEDV'].values
w = np.linalg.solve(X2.T @ X2, X2.T @ y) # optimal parameter vector
yhat = X2 @ w                            # prediction
rmse = ((yhat - y)**2).mean()**0.5       # root mean squared error
rmse
Out[8]:
4.735998462783738
InĀ [9]:
# Display the weights associated with the features.
pd.Series(w[:-1], columns[:-1])
Out[9]:
CRIM       -0.121389
ZN          0.046963
INDUS       0.013468
CHAS        2.839993
NOX       -18.758022
RM          3.658119
AGE         0.003611
DIS        -1.490754
RAD         0.289405
TAX        -0.012682
PTRATIO    -0.937533
LSTAT      -0.552019
dtype: float64
InĀ [10]:
# Display the weights associated with the features, after scaling the columns.
X3 = np.hstack([X / X.std(axis=0), np.ones((X.shape[0], 1))])
y = df['MEDV'].values
w = np.linalg.solve(X3.T @ X3, X3.T @ y) # optimal parameter vector
pd.Series(w[:-1], columns[:-1])
Out[10]:
CRIM      -1.043097
ZN         1.094220
INDUS      0.092302
CHAS       0.720628
NOX       -2.171487
RM         2.567716
AGE        0.101537
DIS       -3.135992
RAD        2.517429
TAX       -2.135271
PTRATIO   -2.027701
LSTAT     -3.938105
dtype: float64
Using the same dataset for both model training and evaluation is a bad idea in machine learning because it can lead to overfitting. When a model is trained and evaluated on the same dataset, it can achieve high accuracy on that data, but it may not generalize well to new, unseen data. This is because the model has simply memorized the training data instead of learning the underlying patterns that can apply to new data.

Execrice 5: Repeat the previous experiment so that the model is built on a training set and evaluated on a distinct test set!

InĀ [11]:
n = df.shape[0]
rs = np.random.RandomState(42)
idxs = rs.permutation(n)
s = int(n * 0.7)
tr = idxs[:s] # indices of the training set
te = idxs[s:] # indices of the test set
InĀ [12]:
len(tr), len(te)
Out[12]:
(354, 152)
InĀ [13]:
X2 = np.hstack([X, np.ones((X.shape[0], 1))])
y = df['MEDV'].values
w = np.linalg.solve(X2[tr].T @ X2[tr], X2[tr].T @ y[tr]) # optimal parameter vector: use training set only!
yhat = X2 @ w                                            # prediction
rmse_te = ((yhat[te] - y[te])**2).mean()**0.5            # root mean squared error: use test set only!
rmse_te
Out[13]:
4.82601778299211

scikit-learn¶

Scikit-learn is a popular open-source machine learning library in Python. It provides a range of supervised and unsupervised learning algorithms for tasks such as classification, regression, clustering, and dimensionality reduction. Scikit-learn also includes tools for model selection, preprocessing, and evaluation, making it a comprehensive library for building and evaluating machine learning models.

  • scikit-learn is based on NumPy, SciPy and matplotlib.
  • The import name of the package is sklearn.
  • Regressors and classifiers in scikit-learn always have a fit() and a predict() method.
  • The fit() methods requires 2 parameters: the input matrix $X$ and a target vector $y$. Calling the fit() method trains a model on the given data.
  • The predict() method requires an input matrix $X$ and returns the prediction of the trained model for the given inputs.

Execrice 6: Repeat the previous experiments using scikit-learn!

InĀ [14]:
# Query the version number of scikit-learn.
import sklearn
sklearn.__version__
Out[14]:
'1.2.1'
InĀ [15]:
# This is how we could create a train-test split with scikit-learn.
# However, we will keep using the original split to make the results comparable.

from sklearn.model_selection import ShuffleSplit
tr2, te2 = next(ShuffleSplit(random_state=42, test_size=0.3).split(X))
InĀ [16]:
# In scikit-learn, the closest thing to RMSE is mean_squared_error.
from sklearn.metrics import mean_squared_error
InĀ [17]:
# Univariate models, no train-test split.

from sklearn.linear_model import LinearRegression

y = df['MEDV'].values

for column in columns[:-1]:
    x = df[[column]].values
    re = LinearRegression()
    re.fit(x, y)
    yhat = re.predict(x)
    rmse = mean_squared_error(y, yhat)**0.5
    print(column, rmse)
CRIM 8.467038200100824
ZN 8.570396495772854
INDUS 8.04153105080589
CHAS 9.045800910882107
NOX 8.306881987569504
RM 6.603071389222561
AGE 8.510228018625199
DIS 8.896422965780747
RAD 8.492632800301259
TAX 8.117097716353987
PTRATIO 7.915314271320455
LSTAT 6.20346413142642
InĀ [18]:
# Multivariate model without bias, no train-test split.
X = df[columns[:-1]].values
y = df['MEDV'].values
re = LinearRegression(fit_intercept=False)
re.fit(X, y)
yhat = re.predict(X)
rmse = mean_squared_error(y, yhat)**0.5
print(rmse)
5.065952606279903
InĀ [19]:
# Multivariate model with bias.
X = df[columns[:-1]].values
y = df['MEDV'].values
re = LinearRegression()
re.fit(X[tr], y[tr]) # use training set
yhat = re.predict(X)
rmse = mean_squared_error(y[te], yhat[te])**0.5 # use test set
print(rmse)
4.826017782992097

Logistic Regression¶

  • Logistic regression is an algorithm for classification.

  • Logistic regression builds a linear model that separates the two classes by a hyperplane.

For simplicity, let' discuss first the univariate case, with a binary target variable and no bias term.

The Programming Exam Problem¶

InĀ [1]:
import pandas as pd
data = [
    {'name': 'David Beckham',    'study_time': 0,   'result': 0},
    {'name': 'Jessica Scott',    'study_time': 7,   'result': 1},
    {'name': 'Jack Johnson',     'study_time': 3.5, 'result': 0},
    {'name': 'Scunner Campbell', 'study_time': 6,   'result': 0},
    {'name': 'Plain Jane ',      'study_time': 3,   'result': 1},
    {'name': 'Archie Gillis',    'study_time': 15,  'result': 1},
]
df = pd.DataFrame(data)
df
Out[1]:
name study_time result
0 David Beckham 0.0 0
1 Jessica Scott 7.0 1
2 Jack Johnson 3.5 0
3 Scunner Campbell 6.0 0
4 Plain Jane 3.0 1
5 Archie Gillis 15.0 1

The above toy data set contains 2 attributes of 6 students:

  • Hours spent on preparing for the exam.
  • Did the student pass the exam? (0=no, 1=yes)

Exercise 1: Train a univariate logistic regression model that estimates the result column from the study_time column!

InĀ [2]:
x = df['study_time'].values # input vector
y = df['result'].values     # target vector

# subtract mean from input
xm = x.mean()
x -= xm

print(x, y)
[-5.75  1.25 -2.25  0.25 -2.75  9.25] [0 1 0 0 1 1]
InĀ [3]:
import numpy as np
def sigmoid(t):
    return 1 / (1 + np.exp(-t))
InĀ [4]:
w = 0 # initial model parameter

for it in range(10):
    yhat = sigmoid(x * w)
    ce = -(np.log(yhat) * y + np.log(1 - yhat) * (1 - y)).sum()
    ce_i = ((yhat - y) * x).sum()
    ce_ii = (yhat * (1 - yhat) * x**2).sum()
    print(f'w={w:.9f} CE(w)={ce:.9f}')    
    w -= ce_i / ce_ii # Newton step
w=0.000000000 CE(w)=4.158883083
w=0.233301976 CE(w)=3.151547142
w=0.328737831 CE(w)=3.065299756
w=0.358540912 CE(w)=3.060405283
w=0.360841429 CE(w)=3.060380944
w=0.360853783 CE(w)=3.060380943
w=0.360853783 CE(w)=3.060380943
w=0.360853783 CE(w)=3.060380943
w=0.360853783 CE(w)=3.060380943
w=0.360853783 CE(w)=3.060380943
InĀ [5]:
# Display the probability of passing the exam (according to the model)
# as a function of the study hours!
import matplotlib.pyplot as plt
x2 = np.arange(0, 24, 0.5)
yhat2 = sigmoid((x2 - xm) * w)
plt.figure(figsize=(16, 6))
plt.plot(x2, yhat2)
plt.grid(True)
plt.xlabel('study hours')
plt.ylabel('P(passing the exam)')
Out[5]:
Text(0, 0.5, 'P(passing the exam)')

The Wisconsin Breast Cancer Problem¶

The Wisconsin Breast Cancer data set contains the attributes of 699 suspicious lesions in tissue microscopy images. The raw data is contained in wisconsin_data.txt, the description can be read in wisconsin_names.txt. The task is to estimate if the lesion is malicious (4) or benign (2), based on the image attributes of the lesion. Therefore the task is a binary classification problem.

Exercise 2: Train a univariate logistic regression model for each input feature separately, and measure the average cross-entropy of the models! Use the full data set both for training and evaluation!

InĀ [6]:
# Column names.
names = [
    'Sample_code_number',
    'Clump_Thickness',
    'Uniformity_of_Cell_Size',
    'Uniformity_of_Cell_Shape',
    'Marginal_Adhesion',
    'Single_Epithelial_Cell_Size',
    'Bare_Nuclei',
    'Bland_Chromatin',
    'Normal_Nucleoli',
    'Mitoses',
    'Class'
]
InĀ [7]:
df = pd.read_csv('../_data/wisconsin_data.txt', sep=',', names=names, na_values='?')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Sample_code_number           699 non-null    int64  
 1   Clump_Thickness              699 non-null    int64  
 2   Uniformity_of_Cell_Size      699 non-null    int64  
 3   Uniformity_of_Cell_Shape     699 non-null    int64  
 4   Marginal_Adhesion            699 non-null    int64  
 5   Single_Epithelial_Cell_Size  699 non-null    int64  
 6   Bare_Nuclei                  683 non-null    float64
 7   Bland_Chromatin              699 non-null    int64  
 8   Normal_Nucleoli              699 non-null    int64  
 9   Mitoses                      699 non-null    int64  
 10  Class                        699 non-null    int64  
dtypes: float64(1), int64(10)
memory usage: 60.2 KB
InĀ [8]:
# substitute mean for the missing Bare_Nuclei values
df['Bare_Nuclei'].fillna(df['Bare_Nuclei'].mean(), inplace=True)
InĀ [9]:
# target vector
y = df['Class'].values // 2 - 1
InĀ [10]:
def logreg_predict(x, w):
    return sigmoid(x * w)

def logreg_fit(x, y, niter=10):
    w = 0 # initial model parameter
    for it in range(niter):
        yhat = logreg_predict(x, w)
        ce_i = ((yhat - y) * x).sum()
        ce_ii = (yhat * (1 - yhat) * x**2).sum()
        w -= ce_i / ce_ii # Newton step    
    return w

def avg_cross_entropy(y, yhat):
    ce = -(np.log(yhat) * y + np.log(1 - yhat) * (1 - y)).sum()
    return ce / len(y)
    
for column in names[1:-1]:
    se = df[column]
    x = (se - se.mean()).values # input vector
    w = logreg_fit(x, y)
    yhat = logreg_predict(x, w)
    ace = avg_cross_entropy(y, yhat)
    print(f'{column:30} AvgCE={ace:.9f} w={w:.9f}')
Clump_Thickness                AvgCE=0.390611001 w=0.905912818
Uniformity_of_Cell_Size        AvgCE=0.199170002 w=1.572130112
Uniformity_of_Cell_Shape       AvgCE=0.211668062 w=1.524221657
Marginal_Adhesion              AvgCE=0.359316116 w=1.107418964
Single_Epithelial_Cell_Size    AvgCE=0.351132841 w=1.537879946
Bare_Nuclei                    AvgCE=0.265288430 w=0.979151501
Bland_Chromatin                AvgCE=0.303804235 w=1.542319399
Normal_Nucleoli                AvgCE=0.355869859 w=1.004452002
Mitoses                        AvgCE=0.527654735 w=1.711238337

Exercise 4: Repeat the previous experiment using scikit-learn!

InĀ [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss # average cross entropy

for column in names[1:-1]:
    se = df[column]
    x = (se - se.mean()).values # input vector
    
    cl = LogisticRegression(fit_intercept=False, C=1e12) # define model
    X = x.reshape((-1, 1))
    cl.fit(X, y)                                         # train model
    w = cl.coef_[0, 0]                                   # extract model parameter
    yhat = cl.predict_proba(X)[:, 1]                     # make probability prediction
    ace = log_loss(y, yhat)                              # compute average cross entropy
    
    print(f'{column:30} AvgCE={ace:.9f} w={w:.9f}')
Clump_Thickness                AvgCE=0.390611001 w=0.905912814
Uniformity_of_Cell_Size        AvgCE=0.199170002 w=1.572130109
Uniformity_of_Cell_Shape       AvgCE=0.211668062 w=1.524221656
Marginal_Adhesion              AvgCE=0.359316116 w=1.107418964
Single_Epithelial_Cell_Size    AvgCE=0.351132841 w=1.537879938
Bare_Nuclei                    AvgCE=0.265288430 w=0.979151504
Bland_Chromatin                AvgCE=0.303804235 w=1.542318627
Normal_Nucleoli                AvgCE=0.355869859 w=1.004451975
Mitoses                        AvgCE=0.527654735 w=1.711237945

Exercise 5: Introduce a 70-30% train-test split and re-run the experiment!

InĀ [12]:
# train-test split
from sklearn.model_selection import ShuffleSplit
tr, te = next(ShuffleSplit(test_size=0.3, random_state=42).split(df))

for column in names[1:-1]:
    se = df[column]
    x = (se - se.mean()).values # input vector
    
    cl = LogisticRegression(fit_intercept=False, C=1e12) # define model
    X = x.reshape((-1, 1))
    cl.fit(X[tr], y[tr])                                 # train model (on training set)
    w = cl.coef_[0, 0]                                   # extract model parameter
    yhat = cl.predict_proba(X)[:, 1]                     # make probability prediction
    ace = log_loss(y[te], yhat[te])                      # compute average cross entropy (on test set)
    
    print(f'{column:30} AvgCE={ace:.9f} w={w:.9f}')
Clump_Thickness                AvgCE=0.335747286 w=0.826808279
Uniformity_of_Cell_Size        AvgCE=0.176748276 w=1.515755629
Uniformity_of_Cell_Shape       AvgCE=0.217456999 w=1.535220556
Marginal_Adhesion              AvgCE=0.300066855 w=1.024555296
Single_Epithelial_Cell_Size    AvgCE=0.350178198 w=1.527215223
Bare_Nuclei                    AvgCE=0.311173420 w=1.008497031
Bland_Chromatin                AvgCE=0.306641746 w=1.453926694
Normal_Nucleoli                AvgCE=0.362623274 w=1.024082073
Mitoses                        AvgCE=0.491826787 w=1.562289264

Multivariate Logistic Regression¶

The previous approach can be generalized to allows multiple input features.

  • model's prediction: $\hat{y} = \sigma(Xw)$
  • objective function: $CE(w) = -\log(\hat{y})^Ty - \log(1 - \hat{y})^T(1 - y)$
  • gradient vector: $\frac{d}{dw} CE(w) = X^T(\hat{y} - y)$
  • Hessian matrix: $\left(\frac{d}{dw}\right)^2 CE(w) = X^T \mathrm{diag}\left(\hat{y}(1 - \hat{y})\right) X$
  • Newton-step: $w_{\mathrm{new}} = w - \left[\left(\frac{d}{dw}\right)^2 CE(w)\right]^{-1} \left[\frac{d}{dw} CE(w)\right]$

Similarly to linear regression, the bias term can be handled by introducing a constant 1 feature.

Exercise 6: Train a multivariate logistic regression model on the training set and measure its cross-entropy on the test set! Implement the training algorithm without using scikit-learn!

InĀ [13]:
X = df[names[1:-1]].values      # input matrix
y = df['Class'].values // 2 - 1 # target vector
InĀ [26]:
def cross_entropy(y, yhat):
    return -np.log(yhat) @ y - np.log(1 - yhat) @ (1 - y)

def logreg_m_fit(X, y, niter=10):
    w = np.zeros(X.shape[1]) # initial model
    for it in range(niter):
        yhat = sigmoid(X @ w)             # prediction
        ce = cross_entropy(y, yhat)       # cross-entropy
        g = X.T @ (yhat - y)              # gradient of cross-entropy
        H = X.T * (yhat * (1 - yhat)) @ X # Hessian matrix of cross-entropy
        print(it, ce)
        w -= np.linalg.solve(H, g)        # Newton-step
        
    return w

w = logreg_m_fit(X[tr], y[tr])
yhat = sigmoid(X @ w)
cross_entropy(y[te], yhat[te])
0 338.9489712938133
1 219.31492649580565
2 194.84385525659366
3 189.2334455179607
4 188.79436565306744
5 188.79058752143467
6 188.7905871820355
7 188.79058718203555
8 188.79058718203555
9 188.79058718203552
Out[26]:
92.3437344981877
InĀ [27]:
w
Out[27]:
array([-0.37775352,  0.81513883,  0.24794682, -0.00703306, -0.7267206 ,
        0.585522  , -0.43993472,  0.32078896, -0.23507939])

Exercise 7: Use scikit-learn for training the model and compare the results against the previous experiment's results!

InĀ [44]:
from sklearn.linear_model import LogisticRegression

cl = LogisticRegression(fit_intercept=False, C=1e12) # no bias, no regularization
cl.fit(X[tr], y[tr])
yhat = cl.predict_proba(X)[:, 1]
ce = cross_entropy(y[te], yhat[te])
print(ce, ce / len(te))
28.330792605335226 0.4105911971787714
InĀ [30]:
cl.coef_[0]
Out[30]:
array([-0.37775325,  0.81513397,  0.24796104, -0.00703901, -0.72671579,
        0.58551021, -0.43993414,  0.32079127, -0.23508148])

$K$-Fold Cross-Validation¶

  • Idea: Randomly split the data to $K$ roughly equal partitions and run $K$ experiments!
  • In the $i$-th experiment, partition $i$ is used as the test set and all other partitions as the training set.
  • In the end, the scores obtained from the $K$ experiments are averaged.
  • $K$-fold cross-validation is slower but more reliable than the simple train-test split.
  • In the stratified variant of the method, the same distribution of labels is enforced in every partition.

Exercise 8: Replace the train-test split to 10-fold cross valiadtion and re-run the last experiment!

InĀ [40]:
from sklearn.model_selection import KFold
from sklearn.metrics import log_loss

cv = KFold(10, shuffle=True, random_state=42)
scores = []
for tr, te in cv.split(X):
    cl = LogisticRegression(fit_intercept=False, C=1e12) # no bias, no regularization
    cl.fit(X[tr], y[tr])
    yhat = cl.predict_proba(X)[:, 1]
    score = log_loss(y[te], yhat[te])
    scores.append(score)
InĀ [41]:
scores
Out[41]:
[0.3103854498633047,
 0.5013502260500776,
 0.492728969735146,
 0.34584544309166304,
 0.5820155518006203,
 0.3244320020003381,
 0.4848963100207666,
 0.3690964445150554,
 0.38712987311278146,
 0.41059119717877135]
InĀ [42]:
np.mean(scores)
Out[42]:
0.4208471467368525

Overfitting & Regularization¶

Regularization) techniques in machine learning are used to prevent or reduce overfitting of a model to the training data. The fundamental idea is to simplify the model, making it less accurate on the training set but more accurate on unseen data. Essentially, the goal of regularization is to set a good balance between model complexity and generalization performance.

Some specific regularization techniques are L1 regularization, L2 regularization, noise injection, data augmentation, early stopping, dropout and ensembling. In this lecture we will discuss L2 regularization that can be seen as simple and well understood form of regularization.

The SMS Spam Problem¶

The file smsspam.txt contains labeled SMS messages. Label "ham" indicates normal messages, label "spam" indicates undesired messages. The goal is to build a classifier that estimates the label based on the text content of the SMS. The error metric should be the average cross-entropy (aka log_loss).

InĀ [1]:
# Load the raw data into a DataFrame!
import pandas as pd
df = pd.read_csv('../_data/smsspam.txt', sep='\t', names=['label', 'message'])
df
Out[1]:
label message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
... ... ...
5567 spam This is the 2nd time we have tried 2 contact u...
5568 ham Will ü b going to esplanade fr home?
5569 ham Pity, * was in mood for that. So...any other s...
5570 ham The guy did some bitching but I acted like i'd...
5571 ham Rofl. Its true to its name

5572 rows Ɨ 2 columns

InĀ [2]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    5572 non-null   object
 1   message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB
InĀ [3]:
# Distribution of labels.
df.groupby('label').size()
Out[3]:
label
ham     4825
spam     747
dtype: int64

How to represent SMS messages as numbers?¶

  • Machine learning algorithms work with numbers, not words.
  • One way to transform text into numbers is the bag-of-words approach.
    • We assign an index to each word, based on a dictionary.
    • Each document is represented by a vector $x = [x_1, \dots, x_d]$ where $x_j$ denotes how many times the $j$-th word appears in the document.

Exercise 1: Compute the bag-of-words representation of the following example documents!

InĀ [4]:
# Example documents.
documents = [
    'John likes to watch movies. Mary likes movies too.',
    'Mary also likes to watch football games.'
]
InĀ [5]:
# Set of all words.

import string

def tokenize(doc):
    return [w.strip(string.punctuation) for w in doc.lower().split()]

words = set()
for doc in documents:
    for w in tokenize(doc):
        words.add(w)
words
Out[5]:
{'also',
 'football',
 'games',
 'john',
 'likes',
 'mary',
 'movies',
 'to',
 'too',
 'watch'}
InĀ [6]:
# Create dictionary of word indices.
word_to_idx = {w: i for i, w in enumerate(sorted(words))}
word_to_idx
Out[6]:
{'also': 0,
 'football': 1,
 'games': 2,
 'john': 3,
 'likes': 4,
 'mary': 5,
 'movies': 6,
 'to': 7,
 'too': 8,
 'watch': 9}
InĀ [7]:
# Build input matrix.
import numpy as np
X = np.zeros((len(documents), len(words)), dtype='int64')

for i, doc in enumerate(documents):
    for w in tokenize(doc):
        j = word_to_idx[w]
        X[i, j] = 1
X
Out[7]:
array([[0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 0, 1, 1, 0, 1, 0, 1]])
InĀ [8]:
# Shorter solution with sklearn's CountVectorizer.
from sklearn.feature_extraction.text import CountVectorizer

X_sp = CountVectorizer(binary=True).fit_transform(documents) # fit(), then transform()
X_sp
Out[8]:
<2x10 sparse matrix of type '<class 'numpy.int64'>'
	with 14 stored elements in Compressed Sparse Row format>
InĀ [9]:
X_sp.toarray()
Out[9]:
array([[0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 0, 1, 1, 0, 1, 0, 1]])

Sparse Matrices in Python¶

  • A sparse matrix contains many zeros and only a low ratio of nenzero elements.
  • The efficient method to store sparse matrices is to store only the nonzeros.
  • The linear algebra operations operations (like matrix-times-vector) use different algorithms as in the dense case.
  • NumPy supports dense matrices only.
  • A library to work with sparse matrices is scipy.sparse. A more general library that supports multidimensional sparse arrays is Sparse.

Some standard storage formats:

  • coo_matrix: A sparse matrix in COOrdinate list format.
  • csr_matrix: Compressed Sparse Row (CSR) matrix
  • csc_matrix: Compressed Sparse Column (CSC) matrix.

Typically, the COO format is the most convnient for creating/building matrices. However, in computations, the CSR or CSC format can be more efficient.

InĀ [10]:
# Create an example COO matrix!
import scipy.sparse as sp
data = [1, 1, 1, 1, 1]
i    = [0, 0, 1, 2, 2] # row indices
j    = [0, 3, 1, 4, 5] # column indices

A = sp.coo_matrix((data, (i, j)))
A
Out[10]:
<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 5 stored elements in COOrdinate format>
InĀ [11]:
# Number of nenzeros.
A.nnz
Out[11]:
5
InĀ [12]:
# Internal representation (.row, .col, .data).
A.row
Out[12]:
array([0, 0, 1, 2, 2], dtype=int32)
InĀ [13]:
A.col
Out[13]:
array([0, 3, 1, 4, 5], dtype=int32)
InĀ [14]:
A.data
Out[14]:
array([1, 1, 1, 1, 1])
InĀ [15]:
# Conversion to NumPy array.
A.toarray()
Out[15]:
array([[1, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 1]])
InĀ [16]:
# Conversion to CSR format.
B = A.tocsr()
B
Out[16]:
<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 5 stored elements in Compressed Sparse Row format>
InĀ [17]:
# Internal representation (.indices, .indptr, .data)
B.indices
Out[17]:
array([0, 3, 1, 4, 5], dtype=int32)
InĀ [18]:
B.indptr
Out[18]:
array([0, 2, 3, 5], dtype=int32)
InĀ [19]:
B.data
Out[19]:
array([1, 1, 1, 1, 1])
InĀ [20]:
# Selecting rows.
B[0]
Out[20]:
<1x6 sparse matrix of type '<class 'numpy.int64'>'
	with 2 stored elements in Compressed Sparse Row format>
InĀ [21]:
# Try to select rows from a COO matrix.
A[0]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [21], in <cell line: 2>()
      1 # Try to select rows from a COO matrix.
----> 2 A[0]

TypeError: 'coo_matrix' object is not subscriptable
InĀ [22]:
# Conversion to CSC format.
C = A.tocsc()
InĀ [23]:
C[:, 0]
Out[23]:
<3x1 sparse matrix of type '<class 'numpy.int64'>'
	with 1 stored elements in Compressed Sparse Column format>
InĀ [24]:
# What happens if we transpose a CSR matrix?
B.T
Out[24]:
<6x3 sparse matrix of type '<class 'numpy.int64'>'
	with 5 stored elements in Compressed Sparse Column format>

Back to the SMS Spam Problem¶

Exercise 2: Prepare the input matrix $X$ (sparse) and the target vector $y$ (dense)!

InĀ [25]:
X = CountVectorizer(binary=True).fit_transform(df['message'])
X
Out[25]:
<5572x8713 sparse matrix of type '<class 'numpy.int64'>'
	with 74169 stored elements in Compressed Sparse Row format>
InĀ [26]:
y = (df['label'] == 'spam').values.astype('int64')
y
Out[26]:
array([0, 0, 1, ..., 0, 0, 0])

Exercise 3: Measure the training and test log_loss of logistic regression, using 5-fold cross-validation! (Scikit-learn's LogisticRegression and LinearRegression accept sparse matrices as the input matrix.)

InĀ [27]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn.model_selection import KFold

cv = KFold(5, shuffle=True, random_state=42)
scores_tr = []
scores_te = []
for tr, te in cv.split(X):
    cl = LogisticRegression()
    cl.fit(X[tr], y[tr])
    yhat = cl.predict_proba(X)[:, 1]
    scores_tr.append(log_loss(y[tr], yhat[tr]))
    scores_te.append(log_loss(y[te], yhat[te]))
    
scores_tr, scores_te
Out[27]:
([0.020030998513910356,
  0.02026798586333316,
  0.018731273244436235,
  0.019306333644836884,
  0.01860439768051876],
 [0.04673015055948079,
  0.042830755193124764,
  0.060999083223150466,
  0.06704297424445832,
  0.06559486598950316])
InĀ [28]:
# average training loss
np.mean(scores_tr)
Out[28]:
0.01938819778940708
InĀ [29]:
# average test loss
np.mean(scores_te)
Out[29]:
0.0566395658419435

L2 Regularization¶

  • L2 penalty term: $\frac{1}{2}\lambda \|w\|^2 = \frac{1}{2}\lambda w^Tw$, where $\lambda \geq 0$. (Other norms) could also be used instead of the L2 norm.)
  • Regularized cross-entropy: $RCE(w) = CE(w) + \frac{1}{2}\lambda w^Tw$.
    • $\lambda$ is called the regularization coefficient. Larger $\lambda$ means stronger regularization.
    • An alternative formulation is $RCE^*(w) = C \cdot CE(w) + \frac{1}{2} w^Tw$.
  • Gradient vector: $\frac{d}{dw} RCE(w) = \frac{d}{dw} CE(w) + \lambda w$.
  • Hessian matrix: $\left(\frac{d}{dw}\right)^2 RCE(w) = \left(\frac{d}{dw}\right)^2 CE(w) + \lambda I$.

Exercise 4: Plot the training and test error of logistic regression, as the function of the regularization coefficient $\lambda$! The relation between $\lambda$ and scikit-learn's $C$ parameter is $C = 1 / \lambda$.

InĀ [37]:
# Model evaluation function.
def evaluate(cl, X, y):    
    cv = KFold(5, shuffle=True, random_state=42)
    scores_tr = []
    scores_te = []
    for tr, te in cv.split(X):
        cl.fit(X[tr], y[tr])
        yhat = cl.predict_proba(X)[:, 1]
        scores_tr.append(log_loss(y[tr], yhat[tr]))
        scores_te.append(log_loss(y[te], yhat[te]))

    return {
        'score_tr': np.mean(scores_tr),
        'score_te': np.mean(scores_te)
    }
InĀ [44]:
# Trying different lambda values.
data = []
for lambd in [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 0.1, 1, 10, 100]:
    print(lambd)
    cl = LogisticRegression(C=1 / lambd)
    res = evaluate(cl, X, y)
    res['lambda'] = lambd
    data.append(res)
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
10
100
InĀ [46]:
df_res = pd.DataFrame(data).set_index('lambda')
df_res.plot(figsize=(12, 6), grid=True, logx=True)
Out[46]:
<AxesSubplot: xlabel='lambda'>
InĀ [49]:
# optimal lambda value
df_res['score_te'].idxmin()
Out[49]:
0.1

Ridge Regression¶

  • Ridge regression is the L2 regularized variant of linear regression.
  • The error function is $RSSE(w) = \left[\sum_{i=1}^n (\hat{y}_i - y_i)^2\right] + \lambda \left[\frac{1}{2} w^T w\right]$,
    where $\lambda \geq 0$ is the regularization coefficient.
  • Linear regression is contained as a special case, with $\lambda$ set to zero.

Exercise 5: Train ridge regression models on the Boston Housing data set that predict the MEDV column. The error metric should be RMSE. Plot the training and test error of ridge regression, as the function of the regularization coefficient $\alpha$! Apply 7-fold cross-validation!

InĀ [50]:
# Load the Boston Housing data set.
columns = []
for l in open('../_data/housing_names.txt'):
    if l[:4] == '    ' and l[4].isdigit():
        columns.append(l.split()[1])
df = pd.read_csv('../_data/housing_data.txt', sep='\t', names=columns)        
InĀ [51]:
X = df[df.columns[:-1]].values # input matrix
y = df['MEDV'].values          # target vector
InĀ [52]:
X.shape
Out[52]:
(506, 12)
InĀ [53]:
y.shape
Out[53]:
(506,)
InĀ [59]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
InĀ [63]:
# Model evaluation function.
def evaluate(re, X, y):    
    cv = KFold(7, shuffle=True, random_state=42)
    scores_tr = []
    scores_te = []
    for tr, te in cv.split(X):
        re.fit(X[tr], y[tr])
        yhat = re.predict(X)
        scores_tr.append(mean_squared_error(y[tr], yhat[tr])**0.5)
        scores_te.append(mean_squared_error(y[te], yhat[te])**0.5)

    return {
        'score_tr': np.mean(scores_tr),
        'score_te': np.mean(scores_te)
    }
InĀ [64]:
evaluate(Ridge(), X, y)
Out[64]:
{'score_tr': 4.741699999401361, 'score_te': 4.909347897956921}
InĀ [65]:
# Trying different alpha values.
data = []
for alpha in [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 0.1, 1, 10, 100]:
    print(alpha)
    re = Ridge(alpha=alpha)
    res = evaluate(re, X, y)
    res['alpha'] = alpha
    data.append(res)
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
10
100
InĀ [66]:
df_res = pd.DataFrame(data).set_index('alpha')
df_res.plot(figsize=(12, 6), grid=True, logx=True)
Out[66]:
<AxesSubplot: xlabel='alpha'>

Decision Tree¶

The decision tree is an old, but still relevant nonlinear learning algorithm. The leaves of the tree represent distinct subsets of the training data. The other nodes compare a given attribute to a threshold value (e.g. is the body temperature > 37 °C). The branches starting from the node are associated with the two possible outcomes of the comparison.

In this notebook, we will prepare the simplest version of the decision tree called decision stump, and we will test it on the Boston Housing data set. Moreover, we will explore the capabilities of scikit-learn's decision tree algorithm.

Exercise 2: Implement the training of the decision stump regressor from scratch, and measure the model's root mean squared error (RMSE) on the full Boston Housing data set!

InĀ [101]:
# Loading the data.
import pandas as pd
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE',
         'DIS', 'RAD', 'TAX', 'PTRATIO', 'LSTAT', 'MEDV']
df = pd.read_csv('../_data/housing_data.txt', delim_whitespace=True, names=names)
X = df[df.columns[:-1]].values # input matrix
y = df['MEDV'].values          # target vector
InĀ [2]:
X.shape
Out[2]:
(506, 12)
InĀ [3]:
y.shape
Out[3]:
(506,)
InĀ [4]:
# Training algorithm.

import numpy as np

n, d = X.shape

sse_min = np.inf
for j in range(d): # iterate over all features
    x = X[:, j] # select j-th column

    # sort x and y by x
    idxs = x.argsort()
    x_sorted = x[idxs]
    y_sorted = y[idxs]

    # find optimal threshold value
    for i in range(n - 1):
        t = (x_sorted[i] + x_sorted[i + 1]) / 2
        y_L = y_sorted[:(i + 1)]
        y_R = y_sorted[(i + 1):]
        yhat_L = y_L.mean() # prediction for left branch
        yhat_R = y_R.mean() # prediction for right branch
        sse_L = ((y_L - yhat_L)**2).sum()
        sse_R = ((y_R - yhat_R)**2).sum()
        sse = sse_L + sse_R # sum of squared errors
        # (instead of re-computing SSE, updating it would be more efficient!)
        
        if sse < sse_min:
            sse_min = sse
            t_opt = t
            j_opt = j
            print(j, t, sse)
0 0.007690000000000001 42714.138495049505
0 0.01001 42654.06214285715
0 0.01306 42607.60066733068
0 0.013354999999999999 42487.76150099801
0 0.014065 42237.84195247638
0 0.014355 42153.999678714856
0 0.014700000000000001 42111.73079365079
0 0.015195 41739.21143434343
0 0.016235 41403.70269905533
0 0.017435 41349.519812763305
0 0.01824 41236.26263646922
0 0.01958 41205.95119132654
0 0.020319999999999998 40890.95242258652
0 0.02182 40635.159448559665
0 0.023425 40556.32548257241
0 0.03393 40547.17303427895
0 0.035235 40380.59639339103
0 0.035809999999999995 40254.02162255965
0 0.03637 40253.174434710054
0 0.037215 40116.550764971194
0 0.038195 40053.84481361776
0 0.04158 40050.55321587302
0 0.060615 40026.656188803965
0 0.06128 39928.78982113821
0 0.061399999999999996 39665.01313689411
0 0.06874 39608.649885609964
0 0.068935 39463.265595959594
0 0.06908 39343.75913705584
0 0.06962 39265.912307865525
0 0.070175 39210.16856426781
0 0.078805 39155.07983894646
0 0.07891000000000001 38989.80508618654
0 0.07996 38928.994316416145
0 0.08193 38717.01742815372
0 0.08232500000000001 38678.05333333334
0 0.08338999999999999 38656.59620356234
0 0.083785 38515.85305258467
0 0.08685499999999999 38423.24137486273
0 0.09085499999999999 38407.847881084235
0 0.091335 38224.95304906361
0 0.092825 38221.28124619482
0 0.09673999999999999 38220.88735692883
0 0.10004 38215.88722212965
0 0.10046 38102.51552353896
0 0.530525 38097.114624419206
0 0.53556 37833.89951690821
0 0.538555 37804.16323365696
0 0.540305 37652.42981832914
0 0.54251 37370.19513926763
0 0.553925 37229.301349661284
0 0.576815 37010.035198692814
0 0.584195 36615.80305075868
0 0.600795 36598.24378066379
0 0.61312 36188.70514973798
0 0.61913 36064.39923206056
0 0.625475 35977.49916167862
0 0.66008 35946.86362004487
0 0.66771 35727.589037974685
0 1.92198 35552.068071910115
0 2.079685 35062.63064576166
0 5.680865 34962.9829047229
0 5.71967 34887.16084843596
0 5.776155 34775.12330894309
0 5.82258 34758.50777564349
0 5.84803 34683.80154100392
0 6.59684 34459.1683803519
0 6.68632 34450.12268540669
2 3.97 34179.313795165566
2 3.97 34079.75626356588
2 3.97 33992.9072965188
2 3.97 33823.920771531106
2 3.97 33446.069910274025
2 3.97 32932.259
2 3.97 32822.44286137958
2 6.2 32525.80000681663
2 6.305 32408.905397365066
2 6.41 32232.526407203906
2 6.41 32069.971986668697
2 6.41 31913.84879658385
2 6.41 31735.28989239707
2 6.66 31633.06994724462
5 6.353 31620.265685618728
5 6.365 31597.217758899675
5 6.3725000000000005 31589.026377765174
5 6.3740000000000006 31580.444654914947
5 6.3765 31550.410197210666
5 6.3785 31385.904983006534
5 6.38 31111.927828065407
5 6.381 30905.164949494945
5 6.3825 30894.875389745866
5 6.3965 30881.240154281644
5 6.4030000000000005 30877.875029335995
5 6.4045000000000005 30643.677037641577
5 6.405 30607.31236292624
5 6.405 30376.976001605777
5 6.4055 30269.188327494005
5 6.4085 30129.227205309136
5 6.413 29979.37480946367
5 6.417 29964.76324786325
5 6.417 29733.06464156396
5 6.423 29683.89051816924
5 6.4254999999999995 29513.365627226638
5 6.428 29506.29679528789
5 6.431 29471.24007837721
5 6.4350000000000005 29115.059845077285
5 6.4365000000000006 28895.02630601379
5 6.4375 28869.883910364148
5 6.4475 28846.588411876586
5 6.4535 28792.809893134086
5 6.455 28628.28351240255
5 6.457000000000001 28576.961106549366
5 6.4585 28456.8392882613
5 6.46 28165.899572877355
5 6.466 27819.337393051967
5 6.4725 27549.313126293997
5 6.4775 27434.100434609827
5 6.4815000000000005 27409.97435883494
5 6.484500000000001 27321.33991896775
5 6.486000000000001 27095.753783150183
5 6.4885 27084.825822994208
5 6.4925 27036.90448457792
5 6.5024999999999995 26913.871297576567
5 6.5105 26872.798036377208
5 6.5120000000000005 26872.49322097378
5 6.5145 26753.396314928657
5 6.5205 26704.4599890533
5 6.5315 26430.367471623744
5 6.539 26411.189509893455
5 6.5425 26192.07816142898
5 6.5455000000000005 25826.708856660527
5 6.652 25815.173483749782
5 6.656000000000001 25512.31084920792
5 6.6655 25259.78012157383
5 6.676 25109.893851662404
5 6.749499999999999 24961.905816346345
5 6.754 24865.09782178218
5 6.782 24328.828513719513
5 6.794 24036.088702747365
5 6.797 23864.15348538103
5 6.824999999999999 23726.463053377523
5 6.8375 23452.510150056238
5 6.918 23443.898181818182
5 6.941 23376.74038861689

Exercise 3: Repeat the previous experiment using scikit-learn!

InĀ [5]:
from sklearn.tree import DecisionTreeRegressor

cl = DecisionTreeRegressor(max_depth=1)
cl.fit(X, y)
Out[5]:
DecisionTreeRegressor(max_depth=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_depth=1)
InĀ [6]:
# Internal parameters of the trained model (.tree_.{feature, threshold, value})
print(cl.tree_.feature)
print(cl.tree_.threshold)
print(cl.tree_.value)
[ 5 -2 -2]
[ 6.94099998 -2.         -2.        ]
[[[22.53280632]]

 [[19.93372093]]

 [[37.23815789]]]

Execrice 4: Apply 3-fold cross-validation!

InĀ [102]:
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

def evaluate(re, X, y):
    cv = KFold(3, random_state=42, shuffle=True)

    scores_tr = []
    scores_te = []
    for tr, te in cv.split(X):
        re.fit(X[tr], y[tr])
        yhat = re.predict(X)
        scores_tr.append(mean_squared_error(y[tr], yhat[tr])**0.5)
        scores_te.append(mean_squared_error(y[te], yhat[te])**0.5)

    return np.mean(scores_tr), np.mean(scores_te)

evaluate(DecisionTreeRegressor(max_depth=1), X, y)
Out[102]:
(6.748971341330223, 7.435013307929704)

Exercise 5: Determine what maximal depth gives the lowest RMSE!

InĀ [106]:
res = []
for max_depth in range(1, 30):
    rmse_tr, rmse_te = evaluate(DecisionTreeRegressor(max_depth=max_depth), X, y)
    res.append({
        'max_depth': max_depth,
        'rmse_tr': rmse_tr,
        'rmse_te': rmse_te
    })

df_res = pd.DataFrame(res).set_index('max_depth')
df_res.plot(figsize=(12, 6), grid=True)
Out[106]:
<AxesSubplot: xlabel='max_depth'>
InĀ [107]:
# Optimal max_depth.
df_res['rmse_te'].idxmin()
Out[107]:
5
InĀ [109]:
df_res.loc[df_res['rmse_te'].idxmin()]
Out[109]:
rmse_tr    2.545337
rmse_te    4.463304
Name: 5, dtype: float64
InĀ [104]:
# Comparison against ridge regression.
from sklearn.linear_model import Ridge
evaluate(Ridge(), X, y)
Out[104]:
(4.736093991350386, 4.8792312907094795)

Exercise 6: Train a decision tree of depth 3 and visualize the trained model!

InĀ [61]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

re = DecisionTreeRegressor(max_depth=3)
re.fit(X, y)
plt.figure(figsize=(14, 6))
plot_tree(re, rounded=True, filled=True, feature_names=names)
pass
InĀ [68]:
# Number of parameters in the tree.
(re.tree_.feature >= 0).sum() * 2
Out[68]:
14
InĀ [72]:
# Total size of the data set.
X.shape[0] * (X.shape[1] + 1)
Out[72]:
6578

Decision Trees for Classification¶

  • Decision trees can also be applied to classification problems.
  • The necessary modification is that instead of sum of squared error, a different split criterion should be applied (e.g. misclassification count, Gini impurity, information gain), and the leaf predictions should be changed to class probabilities.
  • Decision trees can handle multiclass problems too.

Exercise 7: Apply a decision tree classifier for the Wisconsin Breast Cancer data set! Use 5-fold cross-validation! The evaluation metric should be the ratio of correct classifications. Determine the maximal depth that gives the highest accuracy! Compare the best decision tree against logistic regression!

InĀ [75]:
# Load the data to DataFrame.
import pandas as pd
names = [
    'Sample_code_number', 'Clump_Thickness', 'Uniformity_of_Cell_Size',
    'Uniformity_of_Cell_Shape', 'Marginal_Adhesion', 'Single_Epithelial_Cell_Size',
    'Bare_Nuclei', 'Bland_Chromatin', 'Normal_Nucleoli', 'Mitoses', 'Class'
]
df = pd.read_csv('../_data/wisconsin_data.txt', sep=',', names=names, na_values='?')
df['Bare_Nuclei'].fillna(0, inplace=True)
df.head()
Out[75]:
Sample_code_number Clump_Thickness Uniformity_of_Cell_Size Uniformity_of_Cell_Shape Marginal_Adhesion Single_Epithelial_Cell_Size Bare_Nuclei Bland_Chromatin Normal_Nucleoli Mitoses Class
0 1000025 5 1 1 1 2 1.0 3 1 1 2
1 1002945 5 4 4 5 7 10.0 3 2 1 2
2 1015425 3 1 1 1 2 2.0 3 1 1 2
3 1016277 6 8 8 1 3 4.0 3 7 1 2
4 1017023 4 1 1 3 2 1.0 3 1 1 2
InĀ [79]:
# Input matrix.
X = df[names[1:-1]].values
X.shape
Out[79]:
(699, 9)
InĀ [84]:
# Target vector.
y = df['Class'].values // 2 - 1
InĀ [95]:
# Model evaluation function.

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

def evaluate(cl, X, y):
    cv = KFold(5, random_state=42, shuffle=True)

    scores_tr = []
    scores_te = []
    for tr, te in cv.split(X):
        cl.fit(X[tr], y[tr])
        yhat = cl.predict(X)
        scores_tr.append(accuracy_score(y[tr], yhat[tr]))
        scores_te.append(accuracy_score(y[te], yhat[te]))

    return np.mean(scores_tr), np.mean(scores_te)
InĀ [96]:
# Trying different max_depth values.

res = []
for max_depth in range(1, 30):
    acc_tr, acc_te = evaluate(DecisionTreeClassifier(max_depth=max_depth), X, y)
    res.append({
        'max_depth': max_depth,
        'acc_tr': acc_tr,
        'acc_te': acc_te
    })

df_res = pd.DataFrame(res).set_index('max_depth')
df_res.plot(figsize=(12, 6), grid=True)
Out[96]:
<AxesSubplot: xlabel='max_depth'>
InĀ [97]:
# Optimal max_depth value.
df_res['acc_te'].idxmax()
Out[97]:
8
InĀ [100]:
df_res.loc[8]
Out[100]:
acc_tr    0.996783
acc_te    0.947061
Name: 8, dtype: float64
InĀ [99]:
# Comparison to logistic regression.

from sklearn.linear_model import LogisticRegression
evaluate(LogisticRegression(), X, y)
Out[99]:
(0.9703136979299771, 0.9627954779033916)

Decision Trees vs. Linear Models¶

Decision trees

  • ...are insensitive to the scale of the input features šŸ˜€
  • ...are easier to explain šŸ˜€
  • ...can learn complex patterns šŸ˜€
  • ...don not handle sparse data efficiently ā˜¹ļø
  • ...tend to overfit more ā˜¹ļø

Ensemble Methods¶

  • ensemble estimate: $\bar{y} = \frac{1}{K} \sum_{k=1}^K \hat{y}_k$
  • squared error of the ensemble estimate: $\left(\bar{y} - y\right)^2$
  • decomposition into an accuracy and diversity term: $\left(\bar{y} - y\right)^2 = \frac{1}{K} \sum_{k=1}^K (\hat{y}_k - y)^2 - \frac{1}{K}\sum_{k=1}^K (\hat{y}_k - \bar{y})^2$
  • the ensemble will work well, if the members are diverse and individually accurate

Random Forest¶

Random forest is a tree based algorithm for classification and regression. It can be viewed as a "parallel circuit" of decision trees. The output of the forest is the average output of its trees. Some methods to grow an accurate and diverse set of trees are as follows:

  • Bootstrapping: Each tree is trained on a data set that was sampled from the original data with replacement.
  • Feature bagging: Each tree is trained on a random subset of the features.
  • Extreme randomization: For each feature, a random split value is used instead of the optimal one.

Exercise 1: Implement a bootstrapping based random forest regressor and evaluate it on the Boston Housing data set using 3-fold cross-validation! The evaluation metric should be RMSE!

InĀ [1]:
# Load the data to DataFrame.
import pandas as pd
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE',
         'DIS', 'RAD', 'TAX', 'PTRATIO', 'LSTAT', 'MEDV']
df = pd.read_csv('../_data/housing_data.txt', delim_whitespace=True, names=names)
df = df.sample(len(df), random_state=42) # shuffle the data
X = df[df.columns[:-1]].values           # input matrix
y = df['MEDV'].values                    # target vector
InĀ [2]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor

class SimpleRandomForestRegressor:
    def __init__(self, n_trees=100, max_depth=None):
        self.n_trees = n_trees
        self.max_depth = max_depth
    
    def fit(self, X, y):
        rs = np.random.RandomState(42)

        self.trees = []
        for k in range(self.n_trees):
            idxs = rs.randint(0, len(y), len(y))                                 # bootstrapping
            re = DecisionTreeRegressor(max_depth=self.max_depth, random_state=k) # create decision tree
            re.fit(X[idxs], y[idxs])                                             # train decision tree
            self.trees.append(re)
    
    def predict(self, X):
        yhat = np.zeros(len(X))
        for re in self.trees:
            yhat += re.predict(X)
        return yhat / len(self.trees)
InĀ [3]:
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

def evaluate(re, X, y):
    cv = KFold(3, random_state=42, shuffle=True)

    scores_tr = []
    scores_te = []
    for tr, te in cv.split(X):
        re.fit(X[tr], y[tr])
        yhat = re.predict(X)
        scores_tr.append(mean_squared_error(y[tr], yhat[tr])**0.5)
        scores_te.append(mean_squared_error(y[te], yhat[te])**0.5)

    return np.mean(scores_tr), np.mean(scores_te)
InĀ [4]:
evaluate(SimpleRandomForestRegressor(), X, y)
Out[4]:
(1.2496535188109161, 3.5344184688013613)
InĀ [42]:
evaluate(DecisionTreeRegressor(random_state=42), X, y)
Out[42]:
(0.0, 4.867211481330203)
InĀ [43]:
from sklearn.linear_model import Ridge
evaluate(Ridge(), X, y)
Out[43]:
(4.710820669783009, 4.927289718141437)
InĀ [46]:
# How the test RMSE depends on the number of trees.
res = []
for n_trees in range(1, 101):
    print(n_trees)
    re = SimpleRandomForestRegressor(n_trees=n_trees)
    _, rmse_te = evaluate(re, X, y)
    res.append({'n_trees': n_trees, 'rmse_te': rmse_te})
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
InĀ [50]:
df_res = pd.DataFrame(res).set_index('n_trees')
df_res.plot()
Out[50]:
<AxesSubplot: xlabel='n_trees'>

Exercise 2: Implement a feature bagging based random forest regressor and evaluate it on the Boston Housing data set using 3-fold cross-validation!

Our feature bagging strategy: We will bootstrap the columns of X (we will sample d columns with replacement).

InĀ [15]:
class SimpleRandomForestRegressor:
    def __init__(self, n_trees=100, max_depth=None, bootstrap=True, feat_bag=False):
        self.n_trees = n_trees
        self.max_depth = max_depth
        self.bootstrap = bootstrap
        self.feat_bag = feat_bag
    
    def fit(self, X, y):
        rs = np.random.RandomState(42)
        n, d = X.shape

        self.trees = []
        for k in range(self.n_trees):
            row_idxs = np.arange(n)
            if self.bootstrap:
                row_idxs = rs.randint(0, n, n) # bootstrapping
            
            col_idxs = np.arange(d)
            if self.feat_bag:
                col_idxs = rs.randint(0, d, d) # feature bagging
            
            tree = DecisionTreeRegressor(max_depth=self.max_depth, random_state=k) # create decision tree
            tree.fit(X[row_idxs][:, col_idxs], y[row_idxs]) # train decision tree
            self.trees.append((tree, col_idxs))
    
    def predict(self, X):
        yhat = np.zeros(len(X))
        for tree, col_idxs in self.trees:
            yhat += tree.predict(X[:, col_idxs])
        return yhat / len(self.trees)
InĀ [17]:
evaluate(SimpleRandomForestRegressor(bootstrap=False, feat_bag=False), X, y)
Out[17]:
(3.221869811461758e-14, 4.563463998771837)
InĀ [18]:
evaluate(SimpleRandomForestRegressor(bootstrap=True, feat_bag=False), X, y)
Out[18]:
(1.2496535188109161, 3.5344184688013613)
InĀ [19]:
evaluate(SimpleRandomForestRegressor(bootstrap=False, feat_bag=True), X, y)
Out[19]:
(0.018271831222768983, 3.746377378818943)
InĀ [20]:
evaluate(SimpleRandomForestRegressor(bootstrap=True, feat_bag=True), X, y)
Out[20]:
(1.3106116530969623, 3.65460581683147)

Exercise 3: Repeat the previous experiments using scikit-learn's RandomForestRegressor class!

InĀ [29]:
from sklearn.ensemble import RandomForestRegressor

# no bootstrapping, no feature bagging
evaluate(RandomForestRegressor(bootstrap=False, max_features=1.0, random_state=42), X, y)
Out[29]:
(3.221869811461758e-14, 4.534326812091633)
InĀ [28]:
# bootstrapping, no feature bagging
evaluate(RandomForestRegressor(bootstrap=True, max_features=1.0, random_state=42), X, y)
Out[28]:
(1.249721947092092, 3.578281219483339)
InĀ [26]:
# no bootstrapping, feature bagging (using the 50% strategy)
evaluate(RandomForestRegressor(bootstrap=False, max_features=0.5, random_state=42), X, y)
Out[26]:
(0.0008610326361854645, 3.432009311114937)
InĀ [27]:
# bootstrapping and feature bagging (using the 50% strategy)
evaluate(RandomForestRegressor(bootstrap=True, max_features=0.5, random_state=42), X, y)
Out[27]:
(1.2368681562001127, 3.464869293320867)

Exercise 4: Plot the training and the test RMSE as a function of the number of trees!

InĀ [32]:
res = []
for n_trees in range(1, 51):
    print(n_trees, end=' ')
    # no bootstrapping, feature bagging
    re = RandomForestRegressor(n_estimators=n_trees, bootstrap=False, max_features=0.5, random_state=42)
    rmse_tr, rmse_te = evaluate(re, X, y)
    res.append({'n_trees': n_trees, 'rmse_tr': rmse_tr, 'rmse_te': rmse_te})
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
InĀ [39]:
df_res = pd.DataFrame(res).set_index('n_trees')
df_res.plot(figsize=(12, 6), grid=True)
Out[39]:
<AxesSubplot: xlabel='n_trees'>
InĀ [40]:
res = []
for n_trees in range(1, 51):
    print(n_trees, end=' ')
    # bootstrapping, no feature bagging
    re = RandomForestRegressor(n_estimators=n_trees, bootstrap=True, max_features=1.0, random_state=42)
    rmse_tr, rmse_te = evaluate(re, X, y)
    res.append({'n_trees': n_trees, 'rmse_tr': rmse_tr, 'rmse_te': rmse_te})
    
df_res = pd.DataFrame(res).set_index('n_trees')
df_res.plot(figsize=(12, 6), grid=True)    
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
Out[40]:
<AxesSubplot: xlabel='n_trees'>

Exercise 5: Repeat the previous experiment using the "extreme randomization" strategy!

InĀ [45]:
from sklearn.ensemble import ExtraTreesRegressor
evaluate(ExtraTreesRegressor(random_state=42), X, y)
Out[45]:
(3.221869811461758e-14, 3.5508313733547276)
InĀ [46]:
res = []
for n_trees in range(1, 51):
    print(n_trees, end=' ')
    re = ExtraTreesRegressor(n_estimators=n_trees, random_state=42)
    rmse_tr, rmse_te = evaluate(re, X, y)
    res.append({'n_trees': n_trees, 'rmse_tr': rmse_tr, 'rmse_te': rmse_te})
    
df_res = pd.DataFrame(res).set_index('n_trees')
df_res.plot(figsize=(12, 6), grid=True)    
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
Out[46]:
<AxesSubplot: xlabel='n_trees'>
InĀ [55]:
from sklearn.model_selection import ShuffleSplit
tr, te = next(ShuffleSplit(test_size=0.3, random_state=42).split(X))

res = []
re = ExtraTreesRegressor(n_estimators=1, random_state=42)
for n_trees in range(1, 201):
    print(n_trees, end=' ')
    if n_trees > 1:
        re.warm_start = True
        re.n_estimators += 1
    re.fit(X[tr], y[tr])
    yhat = re.predict(X)
    rmse_tr = mean_squared_error(y[tr], yhat[tr])**0.5
    rmse_te = mean_squared_error(y[te], yhat[te])**0.5
    res.append({'n_trees': n_trees, 'rmse_tr': rmse_tr, 'rmse_te': rmse_te})
    
df_res = pd.DataFrame(res).set_index('n_trees')
df_res.plot(figsize=(12, 6), grid=True)    
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 
Out[55]:
<AxesSubplot: xlabel='n_trees'>

Gradient Boosting¶

  • Gradient boosting is another ensemble technique for classification and regression. It can be viewed as a "series circuit" of base learners.
  • The idea of gradient boosting originates from Leo Breiman and Jerome Friedman (1999).
  • The diversity of the base learners is achieved by training them on different targets.
  • The base learners are regressors, both for classification and regression.
  • Usually, the base learners are decision tree regressors, but in theory they could be any regression algorithm.
  • Gradient Boosted Decision Trees (or Gradient Boosting Machine) is a "swiss army knife" method in machine learning. It is invariant to the scale of the feature values and performs well on a wide variety of problems.

Pseudo Code of Training (w/o Learning Rate)¶

Learning Rate¶

  • instead of step size $\gamma_m$, we use $\eta \cdot \gamma_m$, where $\eta \in (0, 1]$
  • $\eta<1$ implements the "slow cooking" idea, and in practice leads to better ensembles than $\eta=1$

Special Case: Gradient Boosting for Regression¶

  • the loss function is the squared loss: $L(y, F(x)) = \frac{1}{2} \left(y - F(x)\right)^2$
  • the initial model is the average target: $F_0(x) = \frac{1}{n} \sum_{i=1}^n y_i$
  • pseudo-residuals: $r_{im} = y_i - F_{m-1}(x_i)$
  • optimal multiplier: $\gamma_m = \left[\sum_{i=1}^n h_m(x_i)r_{im}\right] / \left[\sum_{i=1}^n \left(h_m(x_i)\right)^2\right]$

Exercise 1: Implement a tree based gradient boosting regressor and evaluate it on the Boston Housing data set using 3-fold cross-validation! Use a maximal tree depth of 3!

InĀ [1]:
# Load the Boston Housing data set.
import pandas as pd
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE',
         'DIS', 'RAD', 'TAX', 'PTRATIO', 'LSTAT', 'MEDV']
df = pd.read_csv('../_data/housing_data.txt', delim_whitespace=True, names=names)
df = df.sample(len(df), random_state=42) # data shuffling
X = df.values[:, :-1] # input matrix
y = df['MEDV'].values # target vector
InĀ [3]:
X.shape
Out[3]:
(506, 12)
InĀ [4]:
y.shape
Out[4]:
(506,)
InĀ [30]:
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

def evaluate(re, X, y):
    cv = KFold(3, random_state=42, shuffle=True)

    scores_tr = []
    scores_te = []
    for tr, te in cv.split(X):
        re.fit(X[tr], y[tr])
        yhat = re.predict(X)
        scores_tr.append(mean_squared_error(y[tr], yhat[tr])**0.5)
        scores_te.append(mean_squared_error(y[te], yhat[te])**0.5)

    return np.mean(scores_tr), np.mean(scores_te)
InĀ [36]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor

class SimpleGradientBoostingRegressor:
    def __init__(self, n_trees=100, eta=0.1, max_depth=3):
        self.n_trees = n_trees
        self.eta = eta
        self.max_depth = max_depth
    
    def fit(self, X, y):
        self.F0 = np.mean(y) # best constant model
        r = y - self.F0 # pseudo-residuals

        self.trees = []
        for m in range(self.n_trees):
        
            tree = DecisionTreeRegressor(max_depth=self.max_depth, random_state=m)
            tree.fit(X, r)         # fit base learner
            rhat = tree.predict(X) # base learner's prediction
            gamma = (rhat @ r) / (rhat @ rhat) # optimal step size
            w = self.eta * gamma
            self.trees.append((w, tree)) # save tree and w
            r -= w * rhat        
    
    def predict(self, X):
        yhat = np.ones(len(X)) * self.F0
        for w, tree in self.trees:
            yhat += w * tree.predict(X)
        return yhat
InĀ [39]:
evaluate(SimpleGradientBoostingRegressor(), X, y)
Out[39]:
(1.2716614748909418, 3.363000134587033)

Exercise 2: Repeat the previous experiment using scikit-learn!

InĀ [44]:
from sklearn.ensemble import GradientBoostingRegressor

evaluate(GradientBoostingRegressor(random_state=42), X, y)
Out[44]:
(1.2716614748909418, 3.36820868174328)

Exercise 3: Which tree depth gives the most accurate ensemble?

InĀ [47]:
res = []
for max_depth in range(1, 13):
    print(max_depth, end=' ')
    re = GradientBoostingRegressor(max_depth=max_depth, random_state=42)
    rmse_tr, rmse_te = evaluate(re, X, y)
    res.append({'max_depth': max_depth, 'rmse_tr': rmse_tr, 'rmse_te': rmse_te})
    
df_res = pd.DataFrame(res).set_index('max_depth')
df_res.plot(figsize=(12, 6), grid=True)    
Out[47]:
<AxesSubplot: xlabel='max_depth'>

Exercise 3/B: How the training and test RMSE changes with the number of trees? (Use a simple train-test split for this experiment!)

InĀ [49]:
from sklearn.model_selection import ShuffleSplit
tr, te = next(ShuffleSplit(test_size=0.3, random_state=42).split(X))

res = []
re = GradientBoostingRegressor(n_estimators=1, random_state=42)
for n_trees in range(1, 401):
    print(n_trees, end=' ')
    if n_trees > 1:
        re.warm_start = True
        re.n_estimators += 1
    re.fit(X[tr], y[tr])
    yhat = re.predict(X)
    rmse_tr = mean_squared_error(y[tr], yhat[tr])**0.5
    rmse_te = mean_squared_error(y[te], yhat[te])**0.5
    res.append({'n_trees': n_trees, 'rmse_tr': rmse_tr, 'rmse_te': rmse_te})
    
df_res = pd.DataFrame(res).set_index('n_trees')
df_res.plot(figsize=(12, 6), grid=True)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 
Out[49]:
<AxesSubplot: xlabel='n_trees'>
InĀ [50]:
# What happens if we use deeper trees?
from sklearn.model_selection import ShuffleSplit
tr, te = next(ShuffleSplit(test_size=0.3, random_state=42).split(X))

res = []
re = GradientBoostingRegressor(n_estimators=1, max_depth=6, random_state=42)
for n_trees in range(1, 401):
    print(n_trees, end=' ')
    if n_trees > 1:
        re.warm_start = True
        re.n_estimators += 1
    re.fit(X[tr], y[tr])
    yhat = re.predict(X)
    rmse_tr = mean_squared_error(y[tr], yhat[tr])**0.5
    rmse_te = mean_squared_error(y[te], yhat[te])**0.5
    res.append({'n_trees': n_trees, 'rmse_tr': rmse_tr, 'rmse_te': rmse_te})
    
df_res = pd.DataFrame(res).set_index('n_trees')
df_res.plot(figsize=(12, 6), grid=True)    
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 
Out[50]:
<AxesSubplot: xlabel='n_trees'>

Exercise 4: Apply a random forest and a gradient boosting classifier on the Wisconsin Breast Cancer data set! Use stratified 10-fold cross-validation! The evaluation metric should be the ratio of correct classifications. For both ensemble methods, determine the maximal tree depth that gives the highest accuracy!

InĀ [1]:
# Load the Wisconsin Breast Cancer data set.
import pandas as pd
names = [
    'Sample_code_number', 'Clump_Thickness', 'Uniformity_of_Cell_Size',
    'Uniformity_of_Cell_Shape', 'Marginal_Adhesion', 'Single_Epithelial_Cell_Size',
    'Bare_Nuclei', 'Bland_Chromatin', 'Normal_Nucleoli', 'Mitoses', 'Class'
]
df = pd.read_csv('../_data/wisconsin_data.txt', sep=',', names=names, na_values='?')
df = df.sample(len(df), random_state=42) # data shuffling
df['Bare_Nuclei'].fillna(0, inplace=True)
X = df[df.columns[1: -1]].values
y = (df['Class'].values / 2 - 1).astype('int')
InĀ [3]:
X.shape, y.shape
Out[3]:
((699, 9), (699,))
InĀ [21]:
# Evaluation function.

import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold

def evaluate(cl, X, y):
    cv = StratifiedKFold(10, shuffle=True, random_state=42)
    scores = []
    for tr, te in cv.split(X, y):
        cl.fit(X[tr], y[tr])
        yhat = cl.predict(X) # label prediction
        score = accuracy_score(y[te], yhat[te])
        scores.append(score)
    return np.mean(scores)
InĀ [23]:
# Dummy classifier's accuracy.
from sklearn.dummy import DummyClassifier
evaluate(DummyClassifier(), X, y)
Out[23]:
0.6552173913043479
InĀ [26]:
# Logistic regression's accuracy.
from sklearn.linear_model import LogisticRegression
evaluate(LogisticRegression(), X, y)
Out[26]:
0.9656314699792962
InĀ [31]:
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier

evaluate(GradientBoostingClassifier(random_state=42), X, y)
Out[31]:
0.9527536231884058
InĀ [32]:
evaluate(RandomForestClassifier(random_state=42), X, y)
Out[32]:
0.9698964803312631
InĀ [42]:
# gradient boosting, random forest, different max_depth values, ...
res = []
for max_depth in list(range(1, 11)):
    print(max_depth, end=' ')
    gb = GradientBoostingClassifier(max_depth=max_depth, random_state=42)
    rf = RandomForestClassifier(max_depth=max_depth, random_state=42)
    res.append({'max_depth': max_depth, 'RF_acc': evaluate(rf, X, y), 'GB_acc': evaluate(gb, X, y)})
1 2 3 4 5 6 7 8 9 10 
InĀ [46]:
df_res = pd.DataFrame(res).set_index('max_depth')
df_res.plot(figsize=(12, 6), grid=True)
Out[46]:
<AxesSubplot: xlabel='max_depth'>

Gradient Boosting on Steroids¶

  • XGBoost and LightGBM are a highly efficient and flexible implementations of gradient boosting.
  • XGBoost started as a research project by Tianqi Chen (in 2014).
  • LightGBM was introduced by Microsoft Research (in 2016).

Exercise 5: Compare XGBoost, LightGBM and scikit-learn's GradientBoostingClassifier on the Wisconsin Breast Cancer problem, in terms of speed and accuracy!

InĀ [48]:
import xgboost
xgboost.__version__
Out[48]:
'1.7.3'
InĀ [50]:
import lightgbm
lightgbm.__version__
Out[50]:
'3.3.5'
InĀ [72]:
import time
from xgboost import XGBClassifier

t0= time.time()
print('acc:', evaluate(XGBClassifier(n_estimators=100, max_depth=3), X, y))
print('time:', time.time() - t0)
acc: 0.9585093167701864
time: 0.4498872756958008
InĀ [75]:
from lightgbm import LGBMClassifier
t0= time.time()
print('acc:', evaluate(LGBMClassifier(n_estimators=100, num_leaves=8), X, y))
print('time:', time.time() - t0)
acc: 0.9585300207039337
time: 0.28879475593566895
InĀ [76]:
t0= time.time()
print('acc:', evaluate(GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42), X, y))
print('time:', time.time() - t0)
acc: 0.9527536231884058
time: 1.1589224338531494
InĀ [83]:
# Since version 0.21, scikit-learn includes a histogram based
# gradient boosting algorithm that was inspired by LightGBM.
from sklearn.ensemble import HistGradientBoostingClassifier
t0= time.time()
print('acc:', evaluate(HistGradientBoostingClassifier(max_iter=100, max_leaf_nodes=8, random_state=42), X, y))
print('time:', time.time() - t0)
acc: 0.9585093167701864
time: 1.296454668045044

Artificial Neural Networks¶

  • Artificial neural networks are computing systems vaguely inspired by the human brain.
  • The subject was opened by McCulloch and Pitts (1943) by creating a computational model for neural networks.
  • The network is built of neurons that are interconnected like a web.
  • Each connection, like the synapses in a brain, can transmit a signal (=real number) to other neurons.
  • Main types of neural networks:
    • multilayer perceptron (old school but still useful)
    • autoencoder (for dimension reduction and visualization)
    • convolutional network (originally developed for image classification)
    • recurrent network (originally developed for text classification; example: LSTM)
    • transformer) (originally developed for machine translation)
    • competitive network (example: GAN)
    • ...

Theory of the Multilayer Perceptron in a Nutshell¶

  • input: $x \in \mathbb{R}^{d \times 1}$
  • hidden layer: $h = \sigma(W^T x)$, where $W \in \mathbb{R}^{d \times K}$ and $\sigma$ is the logistic sigmoid function
  • model output: $\hat{y} = \sigma(v^T h)$, where $v \in \mathbb{R}^{K \times 1}$
  • the parameters of the model are the matrix $W$ (hidden weights) and the vector $v$ (output weights)

  • objective function: $CE(W, v) = \sum_{i=1}^n \left( -y_i\log(\hat{y}_i) - (1 - y_i)\log(1 - \hat{y}_i) \right)$
  • derivative by $v$: $\frac{d}{dv} CE(W, v) = \sum_{i=1}^n(\hat{y}_i - y_i) h_i$
  • derivative by $W$: $\frac{d}{dW} CE(W, v) = \sum_{i=1}^n x_i \varepsilon_i^T$, where $\varepsilon_i = (\hat{y}_i - y_i) v \odot h_i \odot(1 - h_i)$ is the backpropagated error
  • the approximate minimization of $CE$ can be done e.g. by stochastic gradient descent

The Phishing Websites Problem¶

The Phishing Websites data set contains certain attributes of web sites. The target attribute is the last column. It specifies whether the site is legitimate (-1) or phishing (+1). Our goal will be to build an artificial neural network that predicts the value of the target attribute.

Exercise 1: Load the Phishing Websites data set to a data frame. Prepare the input matrix and the target vector.

InĀ [1]:
# Load data.
import pandas as pd
from urllib.request import urlopen

url = 'https://archive.ics.uci.edu/'\
      'ml/machine-learning-databases/00327/Training%20Dataset.arff'
lines = urlopen(url).read().decode('utf-8').split('\r\n')
names = [l.split()[1] for l in lines if l.startswith('@att')]
skiprows = lines.index('@data') + 1
df = pd.read_csv(url, names=names, skiprows=skiprows)
InĀ [2]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11055 entries, 0 to 11054
Data columns (total 31 columns):
 #   Column                       Non-Null Count  Dtype
---  ------                       --------------  -----
 0   having_IP_Address            11055 non-null  int64
 1   URL_Length                   11055 non-null  int64
 2   Shortining_Service           11055 non-null  int64
 3   having_At_Symbol             11055 non-null  int64
 4   double_slash_redirecting     11055 non-null  int64
 5   Prefix_Suffix                11055 non-null  int64
 6   having_Sub_Domain            11055 non-null  int64
 7   SSLfinal_State               11055 non-null  int64
 8   Domain_registeration_length  11055 non-null  int64
 9   Favicon                      11055 non-null  int64
 10  port                         11055 non-null  int64
 11  HTTPS_token                  11055 non-null  int64
 12  Request_URL                  11055 non-null  int64
 13  URL_of_Anchor                11055 non-null  int64
 14  Links_in_tags                11055 non-null  int64
 15  SFH                          11055 non-null  int64
 16  Submitting_to_email          11055 non-null  int64
 17  Abnormal_URL                 11055 non-null  int64
 18  Redirect                     11055 non-null  int64
 19  on_mouseover                 11055 non-null  int64
 20  RightClick                   11055 non-null  int64
 21  popUpWidnow                  11055 non-null  int64
 22  Iframe                       11055 non-null  int64
 23  age_of_domain                11055 non-null  int64
 24  DNSRecord                    11055 non-null  int64
 25  web_traffic                  11055 non-null  int64
 26  Page_Rank                    11055 non-null  int64
 27  Google_Index                 11055 non-null  int64
 28  Links_pointing_to_page       11055 non-null  int64
 29  Statistical_report           11055 non-null  int64
 30  Result                       11055 non-null  int64
dtypes: int64(31)
memory usage: 2.6 MB
InĀ [3]:
df.describe().T
Out[3]:
count mean std min 25% 50% 75% max
having_IP_Address 11055.0 0.313795 0.949534 -1.0 -1.0 1.0 1.0 1.0
URL_Length 11055.0 -0.633198 0.766095 -1.0 -1.0 -1.0 -1.0 1.0
Shortining_Service 11055.0 0.738761 0.673998 -1.0 1.0 1.0 1.0 1.0
having_At_Symbol 11055.0 0.700588 0.713598 -1.0 1.0 1.0 1.0 1.0
double_slash_redirecting 11055.0 0.741474 0.671011 -1.0 1.0 1.0 1.0 1.0
Prefix_Suffix 11055.0 -0.734962 0.678139 -1.0 -1.0 -1.0 -1.0 1.0
having_Sub_Domain 11055.0 0.063953 0.817518 -1.0 -1.0 0.0 1.0 1.0
SSLfinal_State 11055.0 0.250927 0.911892 -1.0 -1.0 1.0 1.0 1.0
Domain_registeration_length 11055.0 -0.336771 0.941629 -1.0 -1.0 -1.0 1.0 1.0
Favicon 11055.0 0.628584 0.777777 -1.0 1.0 1.0 1.0 1.0
port 11055.0 0.728268 0.685324 -1.0 1.0 1.0 1.0 1.0
HTTPS_token 11055.0 0.675079 0.737779 -1.0 1.0 1.0 1.0 1.0
Request_URL 11055.0 0.186793 0.982444 -1.0 -1.0 1.0 1.0 1.0
URL_of_Anchor 11055.0 -0.076526 0.715138 -1.0 -1.0 0.0 0.0 1.0
Links_in_tags 11055.0 -0.118137 0.763973 -1.0 -1.0 0.0 0.0 1.0
SFH 11055.0 -0.595749 0.759143 -1.0 -1.0 -1.0 -1.0 1.0
Submitting_to_email 11055.0 0.635640 0.772021 -1.0 1.0 1.0 1.0 1.0
Abnormal_URL 11055.0 0.705292 0.708949 -1.0 1.0 1.0 1.0 1.0
Redirect 11055.0 0.115694 0.319872 0.0 0.0 0.0 0.0 1.0
on_mouseover 11055.0 0.762099 0.647490 -1.0 1.0 1.0 1.0 1.0
RightClick 11055.0 0.913885 0.405991 -1.0 1.0 1.0 1.0 1.0
popUpWidnow 11055.0 0.613388 0.789818 -1.0 1.0 1.0 1.0 1.0
Iframe 11055.0 0.816915 0.576784 -1.0 1.0 1.0 1.0 1.0
age_of_domain 11055.0 0.061239 0.998168 -1.0 -1.0 1.0 1.0 1.0
DNSRecord 11055.0 0.377114 0.926209 -1.0 -1.0 1.0 1.0 1.0
web_traffic 11055.0 0.287291 0.827733 -1.0 0.0 1.0 1.0 1.0
Page_Rank 11055.0 -0.483673 0.875289 -1.0 -1.0 -1.0 1.0 1.0
Google_Index 11055.0 0.721574 0.692369 -1.0 1.0 1.0 1.0 1.0
Links_pointing_to_page 11055.0 0.344007 0.569944 -1.0 0.0 0.0 1.0 1.0
Statistical_report 11055.0 0.719584 0.694437 -1.0 1.0 1.0 1.0 1.0
Result 11055.0 0.113885 0.993539 -1.0 -1.0 1.0 1.0 1.0
InĀ [4]:
X = df[df.columns[:-1]].values      # input matrix
y = ((df['Result'] + 1) / 2).values # target vector
X.shape, y.shape, X.sum(), y.mean()
Out[4]:
((11055, 30), (11055,), 100854, 0.5569425599276345)

Exercise 2: Implement a multilayer perceptron classifier from scratch! Use stochastic gradient descent for training. Evaluate the model on the Phishing Websites data set using a 70%-30% train-test split!

InĀ [9]:
import numpy as np

def sigmoid(t):
    return 1 / (1 + np.exp(-t))

class SimpleMLPClassifier:
    def __init__(self, n_hidden=32, init_range=0.1, n_epochs=5, learning_rate=0.01, random_state=42):
        self.n_hidden = n_hidden
        self.init_range = init_range
        self.n_epochs = n_epochs
        self.learning_rate = learning_rate
        self.random_state = random_state
      
    def _forward(self, x_i):
        h_i = sigmoid(self.W.T @ x_i)  # hidden activations
        yhat_i = sigmoid(self.v @ h_i) # output activation
        return h_i, yhat_i
    
    def fit(self, X, y):
        # model initialization
        n, d = X.shape
        rs = np.random.RandomState(self.random_state)
        self.W = rs.uniform(-self.init_range, self.init_range, (d, self.n_hidden)) # hidden weights
        self.v = rs.uniform(-self.init_range, self.init_range, self.n_hidden)      # output weights

        for e in range(self.n_epochs):
            for i in range(n):
                h_i, yhat_i = self._forward(X[i]) # propagate the the signal forward
                grad_v = (yhat_i - y[i]) * h_i    # derivative w.r.t. v
                eps_i = (yhat_i - y[i]) * self.v * h_i * (1 - h_i) # backpropagated error
                grad_W = np.outer(X[i], eps_i)    # derivative w.r.t. W

                # update model parameters
                self.v -= self.learning_rate * grad_v
                self.W -= self.learning_rate * grad_W
                
        return self
    
    def predict(self, X):
        return (self.predict_proba(X)[:, 1] > 0.5).astype('int')
    
    def predict_proba(self, X):
        n = X.shape[0]
        Yhat = np.zeros((n, 2))
        for i in range(n):
            _, yhat_i = self._forward(X[i])
            Yhat[i, 1] = yhat_i
        Yhat[:, 0] = 1 - Yhat[:, 1]
        return Yhat
InĀ [24]:
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import accuracy_score

tr, te = next(ShuffleSplit(test_size=0.3, random_state=42).split(X))

for n_epochs in range(6):
    cl = SimpleMLPClassifier(n_epochs=n_epochs)
    cl.fit(X[tr], y[tr])
    print(n_epochs, accuracy_score(cl.predict(X[te]), y[te]))
0 0.43050949653301174
1 0.9201085318058486
2 0.9182996683750377
3 0.9173952366596322
4 0.9179981911365692
5 0.9176967138981007

Excercise 3: Compare the previous solution against scikit-learn's MLPClassifier!

InĀ [43]:
from sklearn.neural_network import MLPClassifier

for n_epochs in range(1, 4):
    cl = MLPClassifier(
        hidden_layer_sizes=(32,), activation='logistic', solver='sgd', batch_size=1,
        learning_rate_init=0.01, max_iter=n_epochs, momentum=0, alpha=0, random_state=42
    ) # (the intercept terms cannot be switched off)
    
    cl.fit(X[tr], y[tr])
    print(n_epochs, accuracy_score(cl.predict(X[te]), y[te]))
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (1) reached and the optimization hasn't converged yet.
  warnings.warn(
1 0.9186011456135061
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (2) reached and the optimization hasn't converged yet.
  warnings.warn(
2 0.9198070545673802
3 0.9207114862827857
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (3) reached and the optimization hasn't converged yet.
  warnings.warn(

Excercise 4: Optimize the meta-parameters of the neural network!

InĀ [47]:
def evaluate(cl, X, y):
    tr, te = next(ShuffleSplit(test_size=0.3, random_state=42).split(X))
    cl.fit(X[tr], y[tr])
    return accuracy_score(cl.predict(X[te]), y[te])
InĀ [48]:
# default settings
evaluate(MLPClassifier(random_state=42), X, y)
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(
Out[48]:
0.9644256858607175
InĀ [81]:
# some hand optimization
cl = MLPClassifier(learning_rate_init=0.01, max_iter=50, batch_size=64, random_state=42)
evaluate(cl, X, y)
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
  warnings.warn(
Out[81]:
0.9692493216762135
InĀ [97]:
from sklearn.model_selection import GridSearchCV

cl = MLPClassifier()
param_grid = {
    'learning_rate_init': [0.01, 0.02, 0.05],
    'max_iter': [50, 75],
    'batch_size': [48, 64, 72],
    'random_state': [42],
}

cv = ShuffleSplit(n_splits=1, test_size=0.3, random_state=42)
gs = GridSearchCV(cl, param_grid, cv=cv, verbose=2)
gs.fit(X, y)
Fitting 1 folds for each of 18 candidates, totalling 18 fits
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
  warnings.warn(
[CV] END batch_size=48, learning_rate_init=0.01, max_iter=50, random_state=42; total time=  14.3s
[CV] END batch_size=48, learning_rate_init=0.01, max_iter=75, random_state=42; total time=  19.6s
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
  warnings.warn(
[CV] END batch_size=48, learning_rate_init=0.02, max_iter=50, random_state=42; total time=  12.0s
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (75) reached and the optimization hasn't converged yet.
  warnings.warn(
[CV] END batch_size=48, learning_rate_init=0.02, max_iter=75, random_state=42; total time=  15.7s
[CV] END batch_size=48, learning_rate_init=0.05, max_iter=50, random_state=42; total time=   7.7s
[CV] END batch_size=48, learning_rate_init=0.05, max_iter=75, random_state=42; total time=   9.3s
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
  warnings.warn(
[CV] END batch_size=64, learning_rate_init=0.01, max_iter=50, random_state=42; total time=  14.2s
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (75) reached and the optimization hasn't converged yet.
  warnings.warn(
[CV] END batch_size=64, learning_rate_init=0.01, max_iter=75, random_state=42; total time=  17.8s
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
  warnings.warn(
[CV] END batch_size=64, learning_rate_init=0.02, max_iter=50, random_state=42; total time=   6.9s
[CV] END batch_size=64, learning_rate_init=0.02, max_iter=75, random_state=42; total time=  10.4s
[CV] END batch_size=64, learning_rate_init=0.05, max_iter=50, random_state=42; total time=   7.1s
[CV] END batch_size=64, learning_rate_init=0.05, max_iter=75, random_state=42; total time=   7.2s
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
  warnings.warn(
[CV] END batch_size=72, learning_rate_init=0.01, max_iter=50, random_state=42; total time=   6.0s
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (75) reached and the optimization hasn't converged yet.
  warnings.warn(
[CV] END batch_size=72, learning_rate_init=0.01, max_iter=75, random_state=42; total time=   9.7s
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
  warnings.warn(
[CV] END batch_size=72, learning_rate_init=0.02, max_iter=50, random_state=42; total time=   6.4s
[CV] END batch_size=72, learning_rate_init=0.02, max_iter=75, random_state=42; total time=   7.2s
[CV] END batch_size=72, learning_rate_init=0.05, max_iter=50, random_state=42; total time=   7.3s
[CV] END batch_size=72, learning_rate_init=0.05, max_iter=75, random_state=42; total time=   7.4s
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
  warnings.warn(
Out[97]:
GridSearchCV(cv=ShuffleSplit(n_splits=1, random_state=42, test_size=0.3, train_size=None),
             estimator=MLPClassifier(),
             param_grid={'batch_size': [48, 64, 72],
                         'learning_rate_init': [0.01, 0.02, 0.05],
                         'max_iter': [50, 75], 'random_state': [42]},
             verbose=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=ShuffleSplit(n_splits=1, random_state=42, test_size=0.3, train_size=None),
             estimator=MLPClassifier(),
             param_grid={'batch_size': [48, 64, 72],
                         'learning_rate_init': [0.01, 0.02, 0.05],
                         'max_iter': [50, 75], 'random_state': [42]},
             verbose=2)
MLPClassifier()
MLPClassifier()
InĀ [102]:
df_res = pd.DataFrame(gs.cv_results_)
columns = ['param_batch_size', 'param_learning_rate_init', 'param_max_iter', 'split0_test_score']
df_res.sort_values(columns[-1])[::-1][columns]
Out[102]:
param_batch_size param_learning_rate_init param_max_iter split0_test_score
6 64 0.01 50 0.969249
7 64 0.01 75 0.965632
12 72 0.01 50 0.965330
0 48 0.01 50 0.965029
15 72 0.02 75 0.963521
13 72 0.01 75 0.963521
3 48 0.02 75 0.962918
2 48 0.02 50 0.961411
1 48 0.01 75 0.960808
8 64 0.02 50 0.958396
9 64 0.02 75 0.957190
10 64 0.05 50 0.952970
11 64 0.05 75 0.952970
14 72 0.02 50 0.952367
16 72 0.05 50 0.949955
17 72 0.05 75 0.949955
5 48 0.05 75 0.942116
4 48 0.05 50 0.942116
InĀ [105]:
# Bernold's solution
cl = MLPClassifier(learning_rate_init=0.01, hidden_layer_sizes=(100, 100, 30), random_state=42)
evaluate(cl, X, y)
Out[105]:
0.9701537533916189
InĀ [106]:
cl = MLPClassifier(
    learning_rate_init=0.01, hidden_layer_sizes=(100, 100, 30),
    random_state=42, solver='sgd'
)
evaluate(cl, X, y)
Out[106]:
0.963521254145312
InĀ [107]:
cl = MLPClassifier(
    learning_rate_init=0.01, hidden_layer_sizes=(100, 100, 30),
    random_state=42, solver='sgd', learning_rate='adaptive',
)
evaluate(cl, X, y)
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(
Out[107]:
0.9656315948145915
InĀ [108]:
cl = MLPClassifier(
    learning_rate_init=0.02, hidden_layer_sizes=(100, 100, 30),
    random_state=42, solver='sgd', learning_rate='adaptive',
)
evaluate(cl, X, y)
Out[108]:
0.9674404582454025
InĀ [109]:
cl = MLPClassifier(
    learning_rate_init=0.04, hidden_layer_sizes=(100, 100, 30),
    random_state=42, solver='sgd', learning_rate='adaptive',
)
evaluate(cl, X, y)
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(
Out[109]:
0.9686463671992764
InĀ [110]:
cl = MLPClassifier(
    learning_rate_init=0.08, hidden_layer_sizes=(100, 100, 30),
    random_state=42, solver='sgd', learning_rate='adaptive',
)
evaluate(cl, X, y)
/home/compsci/.local/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:684: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(
Out[110]:
0.966536026529997

Data Visualization Methods¶

PCA¶

  • Principal Component Analysis (PCA) is a standard method for dimensionality reduction and data visualization.
  • The goal is to obtain lower-dimensional data while preserving as much of the data's variance as possible.
  • If only 2 principal compenents are used, then the data points can be visualized in a scatter plot.

Theory¶

  • input matrix: $X \in \mathbb{R}^{n \times d}$
  • eigenvalue decomposition of the "covariance" matrix: $X^T X = W \Lambda W^T$,
    where $\Lambda$ is the diagonal matrix of eigenvalues and the columns of $W$ contain the associated eigenvectors
  • the eigenvalues are sorted in descending order
  • dimensionality reduction: $Y = X W_K$, where $K < d$ and $W_K$ contains the first $K$ columns of $W$

Exercise 1: Visualize the Wisconsin Breast Cancer data set using PCA!

InĀ [1]:
# Load the data to DataFrame.
import pandas as pd
names = [
    'Sample_code_number', 'Clump_Thickness', 'Uniformity_of_Cell_Size',
    'Uniformity_of_Cell_Shape', 'Marginal_Adhesion', 'Single_Epithelial_Cell_Size',
    'Bare_Nuclei', 'Bland_Chromatin', 'Normal_Nucleoli', 'Mitoses', 'Class'
]
df = pd.read_csv('../_data/wisconsin_data.txt', sep=',', names=names, na_values='?')
df['Bare_Nuclei'].fillna(0, inplace=True)
df.head()
Out[1]:
Sample_code_number Clump_Thickness Uniformity_of_Cell_Size Uniformity_of_Cell_Shape Marginal_Adhesion Single_Epithelial_Cell_Size Bare_Nuclei Bland_Chromatin Normal_Nucleoli Mitoses Class
0 1000025 5 1 1 1 2 1.0 3 1 1 2
1 1002945 5 4 4 5 7 10.0 3 2 1 2
2 1015425 3 1 1 1 2 2.0 3 1 1 2
3 1016277 6 8 8 1 3 4.0 3 7 1 2
4 1017023 4 1 1 3 2 1.0 3 1 1 2
InĀ [2]:
X = df[df.columns[1:-1]].values   # input matrix
y = (df['Class'] // 2 - 1).values # target voctor
X.shape
Out[2]:
(699, 9)
InĀ [3]:
# Scikit-Learn based implementation.
from sklearn.decomposition import PCA
pca = PCA(2)
pca.fit(X)           # computes the eigenvalue decomposition of X^T * X
Z = pca.transform(X) # computes X * W_K
InĀ [4]:
# fit(), then transform()
Z = pca.fit_transform(X)
InĀ [5]:
Z
Out[5]:
array([[-4.40904732,  0.01994907],
       [ 4.88110465, -4.90332617],
       [-4.56415067, -0.65506726],
       ...,
       [10.33687644,  7.24412961],
       [ 6.47179821,  2.54870704],
       [ 7.56560301,  1.23300626]])
InĀ [6]:
# NumPy based implementation.
import numpy as np
X2 = X - X.mean(axis=0)         # subtract column means 
L, W = np.linalg.eig(X2.T @ X2) # eigenvalue decomposition
X2 @ W[:, :2]
Out[6]:
array([[  4.40904732,  -0.01994907],
       [ -4.88110465,   4.90332617],
       [  4.56415067,   0.65506726],
       ...,
       [-10.33687644,  -7.24412961],
       [ -6.47179821,  -2.54870704],
       [ -7.56560301,  -1.23300626]])
InĀ [7]:
# Visualization.
import matplotlib.pyplot as plt

def visualize(Z, y):
    cond = (y == 0)
    plt.figure(figsize=(8, 8))
    plt.scatter(Z[cond, 0], Z[cond, 1], alpha=0.6)
    plt.scatter(Z[~cond, 0], Z[~cond, 1], alpha=0.6)
    plt.legend(['benign', 'malignant'])
    
visualize(Z, y)

Exercise 2: Measure the cross-validation accuracy of gradient boosting, so that we reduce the dimension to 1, 2, ..., 9 using PCA.

InĀ [8]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

def evaluate(cl, X, y):
    cv = KFold(5, shuffle=True, random_state=42)
    scores = []
    for tr, te in cv.split(X):
        cl.fit(X[tr], y[tr])
        yhat = cl.predict(X)
        scores.append(accuracy_score(y[te], yhat[te]))
    return np.mean(scores)

evaluate(GradientBoostingClassifier(), X, y)
Out[8]:
0.9556423432682426
InĀ [9]:
evaluate(GradientBoostingClassifier(), X - X.mean(axis=0), y)
Out[9]:
0.9542137718396712
InĀ [10]:
for n_components in range(1, X.shape[1] + 1):
    X2 = PCA(n_components).fit_transform(X)
    acc = evaluate(GradientBoostingClassifier(), X2, y)
    print(n_components, acc)
1 0.9627852004110998
2 0.9570503597122302
3 0.9613360739979445
4 0.9656526207605344
5 0.9642240493319629
6 0.9642240493319629
7 0.9642240493319629
8 0.9599280575539568
9 0.9627852004110997

t-SNE¶

  • t-Distributed Stochastic Neighbor Embedding (t-SNE) can be viewed as a modernized version of PCA.
  • PCA is global in nature: it tries to preserve the distances between any pairs of data points. This leads to visualizations where the points form an elliptical blob.
  • The key idea of t-SNE is to is to preserve only the local relationship between the data points. If two data points are close in the original space, then they should be close in the transformed space too. However, large distances in the original space are not preserved in the transformed space.
  • A detailed blog post about t-SNE and PCA can be read here.

Theory¶

  • data points (rows of the input matrix): $x_1, \dots, x_n \in \mathbb{R}^d$
  • conditional probabilities: $p_{j|i} = \frac{\exp(-\|x_i - x_j\|^2 / (2\sigma^2))}{\sum_{k \neq i} \exp(-\|x_i - x_k\|^2 / (2\sigma^2))}$, $p_{i|i} = 0$
  • probabilities: $p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}$
  • the $\sigma_i$ values are is set using the bisection method so that the perplexity of the conditional distribution equals a predefined perplexity
  • transformed data points: $y_1, \dots, y_n \in \mathbb{R}^K$
  • transformed probabilities: $q_{ij} = \frac{\left(1 + \|y_i - y_j\|^2\right)^{-1}}{\sum_k \sum_{l \neq k} \left(1 + \|y_k - y_l\|^2\right)^{-1}}$
  • Kullback–Leibler divergence: $KL(y_1, \dots, y_n) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$
  • the minimization of the KL divergence performed using approximate gradient descent

Exercise 3: Prepare the t-SNE visualization of the Wisconsin Breast Cancer data set.

InĀ [11]:
from sklearn.manifold import TSNE
tsne = TSNE(2)
Z = tsne.fit_transform(X)
visualize(Z, y)
InĀ [12]:
# Try different perplexity values!
visualize(TSNE(2, perplexity=60).fit_transform(X), y)
InĀ [13]:
visualize(TSNE(2, perplexity=15).fit_transform(X), y)
InĀ [14]:
visualize(TSNE(2, perplexity=7.5).fit_transform(X), y)
InĀ [15]:
# Exercise: Create a t-SNE visualization of the Bike Sharing data set!
fname = '../_data/bike_sharing_data.txt'
df = pd.read_csv(fname)
df
Out[15]:
instant dteday season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
0 1 2011-01-01 1 0 1 0 0 6 0 1 0.24 0.2879 0.81 0.0000 3 13 16
1 2 2011-01-01 1 0 1 1 0 6 0 1 0.22 0.2727 0.80 0.0000 8 32 40
2 3 2011-01-01 1 0 1 2 0 6 0 1 0.22 0.2727 0.80 0.0000 5 27 32
3 4 2011-01-01 1 0 1 3 0 6 0 1 0.24 0.2879 0.75 0.0000 3 10 13
4 5 2011-01-01 1 0 1 4 0 6 0 1 0.24 0.2879 0.75 0.0000 0 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
17374 17375 2012-12-31 1 1 12 19 0 1 1 2 0.26 0.2576 0.60 0.1642 11 108 119
17375 17376 2012-12-31 1 1 12 20 0 1 1 2 0.26 0.2576 0.60 0.1642 8 81 89
17376 17377 2012-12-31 1 1 12 21 0 1 1 1 0.26 0.2576 0.60 0.1642 7 83 90
17377 17378 2012-12-31 1 1 12 22 0 1 1 1 0.26 0.2727 0.56 0.1343 13 48 61
17378 17379 2012-12-31 1 1 12 23 0 1 1 1 0.26 0.2727 0.65 0.1343 12 37 49

17379 rows Ɨ 17 columns

InĀ [16]:
# Extract input matrix.
columns = [
    'season', 'yr', 'mnth', 'hr', 'holiday',
    'weekday', 'workingday', 'weathersit', 'temp', 'atemp',
    'hum', 'windspeed'
]

df[columns].describe().T
Out[16]:
count mean std min 25% 50% 75% max
season 17379.0 2.501640 1.106918 1.00 2.0000 3.0000 3.0000 4.0000
yr 17379.0 0.502561 0.500008 0.00 0.0000 1.0000 1.0000 1.0000
mnth 17379.0 6.537775 3.438776 1.00 4.0000 7.0000 10.0000 12.0000
hr 17379.0 11.546752 6.914405 0.00 6.0000 12.0000 18.0000 23.0000
holiday 17379.0 0.028770 0.167165 0.00 0.0000 0.0000 0.0000 1.0000
weekday 17379.0 3.003683 2.005771 0.00 1.0000 3.0000 5.0000 6.0000
workingday 17379.0 0.682721 0.465431 0.00 0.0000 1.0000 1.0000 1.0000
weathersit 17379.0 1.425283 0.639357 1.00 1.0000 1.0000 2.0000 4.0000
temp 17379.0 0.496987 0.192556 0.02 0.3400 0.5000 0.6600 1.0000
atemp 17379.0 0.475775 0.171850 0.00 0.3333 0.4848 0.6212 1.0000
hum 17379.0 0.627229 0.192930 0.00 0.4800 0.6300 0.7800 1.0000
windspeed 17379.0 0.190098 0.122340 0.00 0.1045 0.1940 0.2537 0.8507
InĀ [17]:
X = df[columns].values
X /= X.std(axis=0)
InĀ [19]:
X.shape
Out[19]:
(17379, 12)
InĀ [33]:
def visualize2(Z, y):
    plt.figure(figsize=(8, 8))
    plt.scatter(Z[:, 0], Z[:, 1], c=y, alpha=0.6)
    plt.colorbar()
InĀ [31]:
n = 2000
Z = TSNE(2).fit_transform(X[:n])
InĀ [34]:
visualize2(Z, df['cnt'].values[:n])

Evaluation Metrics for Binary Classification¶

A classifier's behavior can be analyzed in detail using a confusion matrix. It requires both the predicted and true labels of a dataset and shows the number of instances where the true label is A and the predicted label is B, for all combinations of (A, B). The rows of the matrix represent true labels, while the columns represent predicted labels. For binary classification, the confusion matrix is a 2Ɨ2 table that includes TN (true negatives), FP (false positives), FN (false negatives), and TP (true positives).
Predicted Negative Predicted Positive
Actual Negative TN FP
Actual Positive FN TP

Exercise 1: The file ctg_data.txt contains data about cardiotocographic examinations. See ctg_names.txt for more details. Build a classifier that estimates if the value of the NSP column is 1 (meaning "normal") or not! Use 70%-30% train-test split for evaluation! Measure the accuracy, balanced accuracy, precision, recall, and F1-score of the trained model!

InĀ [1]:
# Load data.
import pandas as pd
df = pd.read_csv('../_data/ctg_data.txt')
df = df.sample(len(df), random_state=42) # permute rows
df
Out[1]:
LB AC FM UC DL DS DP ASTV MSTV ALTV ... Min Max Nmax Nzeros Mode Mean Median Variance Tendency NSP
282 133 0.002 0.010 0.003 0.002 0.0 0.000 46 1.1 0 ... 95 164 5 0 139 135 138 9 0 1
1999 125 0.000 0.001 0.009 0.008 0.0 0.000 62 1.7 0 ... 68 140 5 0 130 116 125 29 1 1
1709 131 0.004 0.003 0.004 0.005 0.0 0.001 60 2.1 0 ... 78 168 8 0 133 127 132 21 0 1
988 131 0.011 0.000 0.005 0.000 0.0 0.000 29 1.3 0 ... 82 171 8 0 143 145 145 9 1 1
2018 125 0.000 0.000 0.008 0.007 0.0 0.001 64 1.3 0 ... 78 155 4 0 114 111 114 7 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1638 130 0.009 0.001 0.004 0.001 0.0 0.000 52 1.3 0 ... 73 172 6 0 144 141 144 16 1 1
1095 123 0.012 0.000 0.002 0.000 0.0 0.000 22 2.2 0 ... 100 152 2 0 131 132 133 4 0 1
1130 122 0.005 0.000 0.004 0.005 0.0 0.000 20 2.6 0 ... 60 158 6 0 131 121 126 31 0 1
1294 115 0.003 0.000 0.008 0.002 0.0 0.001 24 1.6 0 ... 71 179 3 2 133 122 129 45 0 1
860 142 0.001 0.000 0.004 0.000 0.0 0.000 39 0.9 0 ... 127 159 1 0 151 147 149 4 1 1

2126 rows Ɨ 22 columns

InĀ [2]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2126 entries, 282 to 860
Data columns (total 22 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   LB        2126 non-null   int64  
 1   AC        2126 non-null   float64
 2   FM        2126 non-null   float64
 3   UC        2126 non-null   float64
 4   DL        2126 non-null   float64
 5   DS        2126 non-null   float64
 6   DP        2126 non-null   float64
 7   ASTV      2126 non-null   int64  
 8   MSTV      2126 non-null   float64
 9   ALTV      2126 non-null   int64  
 10  MLTV      2126 non-null   float64
 11  Width     2126 non-null   int64  
 12  Min       2126 non-null   int64  
 13  Max       2126 non-null   int64  
 14  Nmax      2126 non-null   int64  
 15  Nzeros    2126 non-null   int64  
 16  Mode      2126 non-null   int64  
 17  Mean      2126 non-null   int64  
 18  Median    2126 non-null   int64  
 19  Variance  2126 non-null   int64  
 20  Tendency  2126 non-null   int64  
 21  NSP       2126 non-null   int64  
dtypes: float64(8), int64(14)
memory usage: 382.0 KB
InĀ [3]:
df.describe().T
Out[3]:
count mean std min 25% 50% 75% max
LB 2126.0 133.303857 9.840844 106.0 126.000 133.000 140.000 160.000
AC 2126.0 0.003178 0.003866 0.0 0.000 0.002 0.006 0.019
FM 2126.0 0.009481 0.046666 0.0 0.000 0.000 0.003 0.481
UC 2126.0 0.004366 0.002946 0.0 0.002 0.004 0.007 0.015
DL 2126.0 0.001889 0.002960 0.0 0.000 0.000 0.003 0.015
DS 2126.0 0.000003 0.000057 0.0 0.000 0.000 0.000 0.001
DP 2126.0 0.000159 0.000590 0.0 0.000 0.000 0.000 0.005
ASTV 2126.0 46.990122 17.192814 12.0 32.000 49.000 61.000 87.000
MSTV 2126.0 1.332785 0.883241 0.2 0.700 1.200 1.700 7.000
ALTV 2126.0 9.846660 18.396880 0.0 0.000 0.000 11.000 91.000
MLTV 2126.0 8.187629 5.628247 0.0 4.600 7.400 10.800 50.700
Width 2126.0 70.445908 38.955693 3.0 37.000 67.500 100.000 180.000
Min 2126.0 93.579492 29.560212 50.0 67.000 93.000 120.000 159.000
Max 2126.0 164.025400 17.944183 122.0 152.000 162.000 174.000 238.000
Nmax 2126.0 4.068203 2.949386 0.0 2.000 3.000 6.000 18.000
Nzeros 2126.0 0.323612 0.706059 0.0 0.000 0.000 0.000 10.000
Mode 2126.0 137.452023 16.381289 60.0 129.000 139.000 148.000 187.000
Mean 2126.0 134.610536 15.593596 73.0 125.000 136.000 145.000 182.000
Median 2126.0 138.090310 14.466589 77.0 129.000 139.000 148.000 186.000
Variance 2126.0 18.808090 28.977636 0.0 2.000 7.000 24.000 269.000
Tendency 2126.0 0.320320 0.610829 -1.0 0.000 0.000 1.000 1.000
NSP 2126.0 1.304327 0.614377 1.0 1.000 1.000 1.000 3.000
InĀ [4]:
X = df[df.columns[:-1]].values    # input matrix
y = (df['NSP'] > 1).astype('int') # target vector (NSP:1 => 0, NSP:{2,3} => 1)

# standardize input matrix
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)

X.shape, y.shape, y.mean()
Out[4]:
((2126, 21), (2126,), 0.2215428033866416)
InĀ [5]:
# generate train-test split
# from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import StratifiedShuffleSplit

def gen_split(X, y):
     return next(StratifiedShuffleSplit(test_size=0.3, random_state=42).split(X, y))

tr, te = gen_split(X, y)
tr.shape, te.shape
Out[5]:
((1488,), (638,))
InĀ [6]:
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, balanced_accuracy_score, precision_score, recall_score, f1_score

def evaluate(cl, X, y):
    tr, te = gen_split(X, y)
    cl.fit(X[tr], y[tr])
    yhat = cl.predict(X)
    return {
        'acc': accuracy_score(y[te], yhat[te]),
        'bal_acc': balanced_accuracy_score(y[te], yhat[te]),
        'prec': precision_score(y[te], yhat[te]),
        'rec': recall_score(y[te], yhat[te]),
        'f1': f1_score(y[te], yhat[te])
    }
InĀ [7]:
evaluate(DummyClassifier(), X, y)
/home/compsci/.local/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
Out[7]:
{'acc': 0.7789968652037618, 'bal_acc': 0.5, 'prec': 0.0, 'rec': 0.0, 'f1': 0.0}
InĀ [8]:
evaluate(LogisticRegression(C=1, class_weight='balanced'), X, y)
Out[8]:
{'acc': 0.5235109717868338,
 'bal_acc': 0.47825962869415073,
 'prec': 0.20363636363636364,
 'rec': 0.3971631205673759,
 'f1': 0.2692307692307693}
InĀ [9]:
evaluate(GradientBoostingClassifier(max_depth=4, random_state=42), X, y)
Out[9]:
{'acc': 0.7586206896551724,
 'bal_acc': 0.5097820968363372,
 'prec': 0.2903225806451613,
 'rec': 0.06382978723404255,
 'f1': 0.10465116279069768}
  • The quality of the models is quite poor.
  • However, we will still use them for learning purposes.

Exercise 2: Prepare the confusion matrix of the trained model! Compute accuracy, balanced accuracy, precision, recall and F1-score from the elements of the confusion matrix!

InĀ [10]:
from sklearn.metrics import confusion_matrix

cl = GradientBoostingClassifier(max_depth=4, random_state=42)
cl.fit(X[tr], y[tr])
yhat = cl.predict(X)
(tn, fp), (fn, tp) = confusion_matrix(y[te], yhat[te])
InĀ [11]:
p = tp + fn # total number of actual positives
n = tn + fp # total number of actual negatives
print('accuracy:', (tp + tn) / (p + n))
print('balanced accuracy:', ((tp / p) + (tn / n)) / 2)
prec = tp / (tp + fp)
print('precision:', prec)
rec = tp / p
print('recall:', rec)
print('f1:', 2 / (1 / prec + 1 / rec))
accuracy: 0.7586206896551724
balanced accuracy: 0.5097820968363372
precision: 0.2903225806451613
recall: 0.06382978723404255
f1: 0.10465116279069768
  • raising the decision threshold: decreases TPR and FPR
  • lowering the decision threshold: increases TPR and FPR

Exercise 3: Optimize the decision threshold of the trained model in terms of the the test F1-score.

InĀ [12]:
import numpy as np

p = cl.predict_proba(X)[:, 1]

data = []
for th in np.arange(0, 1.05, 0.05):
    yhat = (p > th) # probability prediction => label prediction
    data.append({
        'th': th,
        'prec': precision_score(y[te], yhat[te], zero_division=0),
        'rec': recall_score(y[te], yhat[te]),
        'f1': f1_score(y[te], yhat[te])
    })
df2 = pd.DataFrame(data).set_index('th')
df2.plot()
Out[12]:
<AxesSubplot: xlabel='th'>
InĀ [13]:
# optimal threshold
df2['f1'].idxmax()
Out[13]:
0.1
The receiver operating characteristic (ROC) curve is a graphical representation of the performance of a binary classification model. It plots the true positive rate (TPR) against the false positive rate (FPR) for different thresholds of a classifier. The ROC curve can help evaluate the trade-off between the classifier's sensitivity (ability to detect positive instances) and specificity (ability to avoid false positives). The area under the ROC curve (AUC) is a commonly used metric to evaluate the performance of a binary classifier. It ranges from 0.5 (random guessing) to 1.0 (perfect classification).

Exercise 4: Display the ROC curve corresponding to the trained model! Compute the area under the ROC curve!

InĀ [14]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve

fpr, tpr, th = roc_curve(y[te], p[te])
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], '--')
plt.xlabel('false positive rate')
plt.xlabel('true positive rate')
Out[14]:
Text(0.5, 0, 'true positive rate')
InĀ [15]:
# Train a classifier on the Wisconsin Breast cancer data set,
# and draw its ROC curve!

# Column names.
names = [
    'Sample_code_number', 'Clump_Thickness', 'Uniformity_of_Cell_Size', 'Uniformity_of_Cell_Shape',
    'Marginal_Adhesion', 'Single_Epithelial_Cell_Size', 'Bare_Nuclei', 'Bland_Chromatin',
    'Normal_Nucleoli', 'Mitoses', 'Class'
]
df = pd.read_csv('../_data/wisconsin_data.txt', sep=',', names=names, na_values='?')
df['Bare_Nuclei'].fillna(df['Bare_Nuclei'].mean(), inplace=True)

X = df[names[1:-1]].values      # input matrix
y = df['Class'].values // 2 - 1 # target vector

tr, te = next(StratifiedShuffleSplit(test_size=0.3, random_state=42).split(X, y))
cl = GradientBoostingClassifier(random_state=42)
cl.fit(X[tr], y[tr])
p = cl.predict_proba(X)[:, 1]

fpr, tpr, th = roc_curve(y[te], p[te])
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], '--')
plt.xlabel('false positive rate')
plt.xlabel('true positive rate')
Out[15]:
Text(0.5, 0, 'true positive rate')

Clustering algorithms¶

  • Clustering is an unsupervised learning technique.
  • The goal is to identify groups of data points that are similar to each other.
  • Types clustering algorithms:
    • distance based
    • density based
    • ...
  • Real-world example: Find similarly behaving groups of customers for a web shop.

K-Means¶

  • K-Means is a distance based clustering algorithm.
  • Input:
    • $x_1$, ..., $x_n \in \mathbb{R}^d$ (data points)
    • $K \in \mathbb{N}$ (number of clusters)
  • Initialization: randomly generate $K$ cluster centers $c_1$, ..., $c_K$ (e.g. by randomly selecting $K$ data points).
  • Loop:
    • Assign each data point to the nearest cluster center.
    • Re-compute cluster centers as the center ("mean") of cluster members.
    • If there is no change, then stop the loop.
InĀ [1]:
import pandas as pd

# Let's load the Wisconsin Breast Cancer data set!
names = [
    'Sample_code_number', 'Clump_Thickness', 'Uniformity_of_Cell_Size', 'Uniformity_of_Cell_Shape',
    'Marginal_Adhesion', 'Single_Epithelial_Cell_Size', 'Bare_Nuclei', 'Bland_Chromatin',
    'Normal_Nucleoli', 'Mitoses', 'Class'
]
df = pd.read_csv('../_data/wisconsin_data.txt', sep=',', names=names, na_values='?')

# Substitute the null values in the Bare_Nuclei column!
df['Bare_Nuclei'].fillna(df['Bare_Nuclei'].mean(), inplace=True)

# Extract input matrix!
X = df[names[1: -1]].values
X
Out[1]:
array([[ 5.,  1.,  1., ...,  3.,  1.,  1.],
       [ 5.,  4.,  4., ...,  3.,  2.,  1.],
       [ 3.,  1.,  1., ...,  3.,  1.,  1.],
       ...,
       [ 5., 10., 10., ...,  8., 10.,  2.],
       [ 4.,  8.,  6., ..., 10.,  6.,  1.],
       [ 4.,  8.,  8., ..., 10.,  4.,  1.]])
InĀ [2]:
X.shape
Out[2]:
(699, 9)
InĀ [3]:
# column standard deviations
X.std(axis=0)
Out[3]:
array([2.81372582, 3.0492756 , 2.96978617, 2.85333603, 2.21271541,
       3.59927429, 2.43661945, 3.05144882, 1.7138507 ])

Exercise 1: Implement K-means clustering from scratch and test it on the Wisconsin Breast Cancer data set!

InĀ [37]:
K = 5 # number of clusters
n_iter = 100

# randomly draw K cluster centers
import numpy as np
rs = np.random.RandomState(42)
C = X[rs.permutation(X.shape[0])[:K]]

for i in range(n_iter):
    # compute nearest center for each data point
    clusters = np.array([((C - xi)**2).sum(axis=1).argmin() for xi in X])

    # re-compute cluster centers
    for k in range(K):
        C[k] = X[clusters == k].mean(axis=0)
InĀ [38]:
clusters
Out[38]:
array([1, 4, 1, 4, 1, 3, 0, 2, 2, 1, 2, 2, 1, 2, 3, 1, 1, 1, 4, 1, 4, 3,
       1, 4, 2, 4, 1, 1, 2, 2, 1, 2, 3, 2, 1, 2, 3, 1, 4, 4, 3, 4, 3, 4,
       3, 2, 4, 2, 1, 4, 4, 0, 4, 3, 4, 4, 3, 1, 4, 1, 4, 2, 3, 1, 2, 1,
       1, 4, 3, 2, 1, 3, 2, 4, 4, 2, 2, 1, 0, 2, 2, 1, 1, 1, 3, 3, 4, 3,
       1, 2, 2, 1, 1, 2, 2, 2, 2, 1, 3, 3, 4, 0, 1, 1, 3, 1, 4, 3, 2, 4,
       2, 4, 4, 3, 0, 0, 1, 3, 2, 1, 2, 1, 3, 4, 3, 2, 4, 1, 4, 2, 1, 2,
       3, 1, 1, 1, 1, 1, 1, 0, 1, 2, 4, 0, 2, 0, 4, 2, 1, 3, 2, 4, 3, 1,
       2, 4, 2, 2, 0, 3, 4, 1, 1, 0, 1, 1, 3, 3, 1, 2, 1, 2, 2, 3, 4, 3,
       2, 3, 1, 4, 2, 2, 1, 3, 4, 2, 3, 3, 3, 2, 3, 3, 1, 2, 1, 1, 4, 1,
       2, 1, 3, 4, 2, 1, 2, 3, 3, 2, 2, 1, 3, 3, 2, 3, 3, 3, 2, 2, 3, 1,
       2, 3, 0, 4, 4, 2, 4, 3, 2, 3, 4, 3, 1, 4, 1, 0, 3, 3, 3, 4, 1, 1,
       2, 0, 2, 1, 3, 4, 1, 0, 2, 4, 4, 3, 3, 4, 1, 1, 1, 4, 4, 3, 3, 4,
       3, 1, 4, 4, 3, 2, 4, 1, 4, 1, 1, 0, 1, 2, 2, 4, 1, 2, 4, 4, 4, 3,
       3, 1, 4, 3, 2, 2, 3, 4, 0, 4, 4, 1, 1, 4, 3, 2, 3, 2, 4, 4, 2, 2,
       3, 0, 2, 2, 3, 2, 2, 3, 4, 3, 2, 4, 4, 0, 1, 4, 2, 1, 4, 2, 3, 4,
       4, 1, 1, 4, 4, 2, 4, 2, 2, 4, 4, 2, 2, 2, 3, 2, 1, 2, 0, 4, 1, 2,
       4, 3, 2, 1, 1, 3, 3, 4, 3, 4, 0, 0, 2, 2, 3, 3, 2, 2, 1, 2, 1, 1,
       1, 2, 2, 2, 1, 1, 2, 4, 1, 2, 2, 1, 4, 1, 2, 1, 2, 3, 1, 2, 2, 1,
       1, 1, 1, 2, 3, 1, 1, 0, 2, 2, 1, 2, 2, 1, 2, 0, 3, 1, 4, 0, 4, 2,
       1, 2, 0, 3, 1, 1, 1, 3, 1, 3, 2, 2, 2, 1, 1, 1, 4, 4, 3, 1, 1, 1,
       4, 0, 0, 2, 1, 2, 2, 1, 2, 3, 1, 1, 1, 3, 2, 1, 4, 3, 1, 1, 1, 0,
       1, 1, 1, 3, 4, 4, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 3, 3,
       1, 0, 1, 3, 4, 1, 2, 4, 1, 3, 0, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1,
       3, 0, 1, 2, 2, 1, 1, 1, 3, 3, 2, 2, 1, 4, 2, 1, 4, 4, 1, 1, 1, 1,
       1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 3, 2, 1, 3,
       1, 2, 1, 0, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 3, 1, 1, 4, 3, 4, 3,
       1, 2, 3, 1, 1, 2, 2, 2, 1, 3, 3, 1, 1, 2, 3, 1, 3, 1, 3, 4, 4, 1,
       4, 1, 1, 1, 1, 1, 1, 2, 1, 3, 4, 3, 1, 2, 3, 1, 4, 3, 3, 2, 2, 1,
       1, 0, 1, 1, 1, 1, 1, 2, 1, 0, 3, 0, 2, 1, 1, 1, 2, 3, 1, 1, 3, 1,
       1, 1, 1, 1, 1, 2, 2, 1, 2, 2, 3, 1, 0, 2, 1, 1, 1, 1, 1, 1, 4, 2,
       2, 1, 2, 2, 1, 2, 1, 1, 3, 3, 3, 1, 2, 1, 2, 1, 2, 1, 2, 2, 3, 3,
       1, 2, 2, 2, 2, 1, 1, 2, 2, 3, 1, 1, 1, 2, 3, 3, 3])

Exercise 2: Visualize the clusters on a scatter plot of the data!

InĀ [42]:
# apply t-SNE on the data matrix X
from sklearn.manifold import TSNE
Z = TSNE(random_state=42).fit_transform(X)
InĀ [53]:
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 8))
for k in range(K):
    plt.scatter(Z[clusters == k, 0], Z[clusters == k, 1], alpha=0.7)

Exercise 3: Repeat the previous experiment using scikit learn's built in KMeans class!

InĀ [68]:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=5, init='random', n_init=1, random_state=42)
clusters = km.fit(X).predict(X)

import matplotlib.pyplot as plt
plt.figure(figsize=(8, 8))
for k in range(K):
    plt.scatter(Z[clusters == k, 0], Z[clusters == k, 1], alpha=0.7)